nhkore 0.2.0 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 6ab82aafdbc996ca3f0f010d533adb165df24d63a4799ba8812596551506d52c
4
- data.tar.gz: 5cb8b107928f7ba4c3e0100d70748b616f897a1b0e8c70149fdbf3ce09c39bd7
3
+ metadata.gz: a02c041aff2b0b040ff00acaeeaa54506e574e9575613b48c4dd75ef0ef45564
4
+ data.tar.gz: dd3f570c0e7b7223c039d4119989e04514ef0d4bf4fb53485397271021f39246
5
5
  SHA512:
6
- metadata.gz: 2e85f11cb8b88605964e656234c746adb514d947fa46523fe464e601ae87cc1cdb7f8f32407e317c7e0c470cdfefde24251b7704cf014399a9eeb300fbd43936
7
- data.tar.gz: 6b6aecea79efcf9f936667aa2d6a60b7255ee49de6929a576db468504e5084b254cb64729f8638de2c1814cc1223cd9a8ed04703d1a737ff805b3b2a5566102b
6
+ metadata.gz: 4f5021ab1fd74bb1c5a42574fa1045f71069b6f8ab6cf7b1717e6164505127e6c657f2e36be903dc190d356bed83fdf8c2de4c89644c7676863cfb9a8c53da8f
7
+ data.tar.gz: e082a6ed70bacccb763386e00d8ca92351d4ee8d9f2d32a9b79dc6a2733ea46cd739e95550124e4807281c58f3a65faf9b8496740a51cca13e068ecd3e882d3a
@@ -2,7 +2,41 @@
2
2
 
3
3
  Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
4
4
 
5
- ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.2.0...master)
5
+ ## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.0...master)
6
+
7
+ ## [v0.3.0] - 2020-04-12
8
+
9
+ ### Added
10
+ - UserAgents
11
+ - Tons of random `User-Agent` strings for `Scraper`.
12
+
13
+ ### Changed
14
+ - BingCmd => SearchCmd
15
+ - Major (breaking) change.
16
+ - Changed `$nhkore bing easy` to:
17
+ - `$ nhkore search easy bing`
18
+ - `$ nhkore se ez b`
19
+ - App
20
+ - Added options:
21
+ - `--color` (force color output for demos)
22
+ - `--user-agent` (specify a custom HTTP header field `User-Agent`)
23
+ - If `out_dir` is empty, don't prompt if okay to overwrite.
24
+ - README/nhkore.gemspec
25
+ - Added more info.
26
+ - Changed description.
27
+
28
+ ### Fixed
29
+ - Scraper/BingScraper
30
+ - Big fix.
31
+ - Fixed to get around bing's strictness.
32
+ - Use a random `User-Agent` from `UserAgents`.
33
+ - Set HTTP header field `cookie` from `set-cookie` response.
34
+ - Added `http-cookie` gem.
35
+ - Use RSS as a fallback.
36
+ - GetCmd
37
+ - When extracting files...
38
+ - ignore empty filenames in the Zip for safety.
39
+ - ask to overwrite files instead of erroring.
6
40
 
7
41
  ## [v0.2.0] - 2020-04-01
8
42
  First working version.
data/README.md CHANGED
@@ -10,20 +10,39 @@ A CLI app that scrapes [NHK News Web Easy](https://www3.nhk.or.jp/news/easy/) to
10
10
 
11
11
  This is similar to a [core word/vocabulary list](https://www.fluentin3months.com/core-japanese-words/), hence the name NHKore.
12
12
 
13
- In the future, I would like to add the regular NHK News, using the links from the easy versions.
13
+ [![asciinema Demo - Help](https://asciinema.org/a/MQTJ9vxcpB7VYAKzke7m4QM7P.png)](https://asciinema.org/a/MQTJ9vxcpB7VYAKzke7m4QM7P?speed=2)
14
14
 
15
15
  ## Contents
16
16
 
17
- - [Installing](#installing)
18
- - [Using](#using)
19
- - [Hacking](#hacking)
20
- - [License](#license)
17
+ - [For Non-Power Users](#for-non-power-users-)
18
+ - [Installing](#installing-)
19
+ - [Using](#using-)
20
+ - [The Basics](#the-basics-)
21
+ - [Unlimited Power!](#unlimited-power-)
22
+ - [Get Command](#get-command-)
23
+ - [Sift Command](#sift-command-)
24
+ - [Sakura Fields Forever](#sakura-fields-forever-)
25
+ - [Search Command](#search-command-)
26
+ - [News Command](#news-command-)
27
+ - [Using the Library](#using-the-library-)
28
+ - [Hacking](#hacking-)
29
+ - [License](#license-)
21
30
 
22
- ## [Installing](#contents)
31
+ ## For Non-Power Users [^](#contents)
32
+
33
+ For non-Power Users, you are probably just interested in the data.
34
+
35
+ [Click here](https://esotericpig.github.io/showcase/nhkore-ez.html) for a big HTML file of the final result from all of the current articles scraped.
36
+
37
+ [Click here](https://github.com/esotericpig/nhkore/releases/latest) to go to the latest release and download `nhkore-core.zip` from the `Assets`. It contains all of the links scraped, all of the data scraped per article, and a final CSV file.
38
+
39
+ If you'd like to try using the app, please download and install [Ruby](https://www.ruby-lang.org/en/downloads/) and then follow the instructions below. You'll need to be able to use the command line.
40
+
41
+ ## Installing [^](#contents)
23
42
 
24
43
  Pick your poison...
25
44
 
26
- With the RubyGems CLI package manager:
45
+ With the RubyGems package manager:
27
46
 
28
47
  `$ gem install nhkore`
29
48
 
@@ -32,14 +51,267 @@ Manually:
32
51
  ```
33
52
  $ git clone 'https://github.com/esotericpig/nhkore.git'
34
53
  $ cd nhkore
35
- $ rake install
54
+ $ gem build nhkore.gemspec
55
+ $ gem install *.gem
36
56
  ```
37
57
 
38
- ## [Using](#contents)
58
+ If there are errors running `nhkore`, you may need to also [install Nokogiri](https://nokogiri.org/tutorials/installing_nokogiri.html) manually, which is used for scraping HTML.
59
+
60
+ ## Using [^](#contents)
61
+
62
+ ### The Basics [^](#contents)
39
63
 
40
- TODO: update README Using section
64
+ The most useful thing to do is to simply scrape one article and then study the most frequent words before reading that article.
41
65
 
42
- ## [Hacking](#contents)
66
+ First, scrape the article:
67
+
68
+ `$ nhkore news easy -u 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html'`
69
+
70
+ If your internet is slow, there are several global options to help alleviate your internet woes, which can be used with any sub command:
71
+
72
+ ```
73
+ -m --max-retry=<value> maximum number of times to retry URLs
74
+ (-1 or integer >= 0) (default: 3)
75
+ -o --open-timeout=<value> seconds for URL open timeouts
76
+ (-1 or decimal >= 0)
77
+ -r --read-timeout=<value> seconds for URL read timeouts
78
+ (-1 or decimal >= 0)
79
+ -t --timeout=<value> seconds for all URL timeouts: [open, read]
80
+ (-1 or decimal >= 0)
81
+ ```
82
+
83
+ Example usage:
84
+
85
+ `$ nhkore -t 300 -m 10 news easy -u 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html'`
86
+
87
+ Some older articles will fail to scrape and need additional options (this is very rare):
88
+
89
+ ```
90
+ -D --no-dict do not try to parse the dictionary files
91
+ for the articles; useful in case of errors
92
+ trying to load the dictionaries (or for offline testing)
93
+ -L --lenient leniently (not strict) scrape articles:
94
+ body & title content without the proper
95
+ HTML/CSS classes/IDs and no futsuurl;
96
+ example URLs:
97
+ - https://www3.nhk.or.jp/news/easy/article/disaster_earthquake_02.html
98
+ - https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html
99
+ -M --missingno very rarely an article will not have kana or kanji
100
+ for a Ruby tag; to not raise an error, this will
101
+ use previously scraped data to fill it in;
102
+ example URL:
103
+ - https://www3.nhk.or.jp/news/easy/k10012331311000/k10012331311000.html
104
+ -d --datetime=<value> date time to use as a fallback in cases
105
+ when an article doesn't have one;
106
+ format: YYYY-mm-dd H:M; example: 2020-03-30 15:30
107
+ ```
108
+
109
+ Example usage:
110
+
111
+ `$ nhkore -t 300 -m 10 news -D -L -M -d '2011-03-07 06:30' easy -u 'https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html'`
112
+
113
+ Now that the data from the article has been scraped, you can generate a CSV/HTML/YAML file of the words ordered by frequency:
114
+
115
+ ```
116
+ $ nhkore sift easy -e csv
117
+ $ nhkore sift easy -e html
118
+ $ nhkore sift easy -e yml
119
+ ```
120
+
121
+ If you have other scraped articles, then you'll need to filter down to the specific one:
122
+
123
+ | Command | Description |
124
+ | --- | --- |
125
+ | `$ nhkore sift easy -u k10011862381000` | Filter by URL |
126
+ | `$ nhkore sift easy -t '植えられた桜'` | Filter by title |
127
+ | `$ nhkore sift easy -d '2019-3-29 11:30'` | Filter by date time |
128
+ | `$ nhkore sift easy -d '2019-3-29' -t '桜'` | Filter by date time &amp; title |
129
+ | `$ nhkore sift easy -d '2019-3-29' -t '桜' -e html` | Filter &amp; output HTML |
130
+ | `$ nhkore sift easy -d '2019-3-29' -t '桜' -o 'sakura.html'` | Filter &amp; output HTML |
131
+
132
+ Complete demo:
133
+
134
+ [![asciinema Demo - The Basics](https://asciinema.org/a/316571.png)](https://asciinema.org/a/316571)
135
+
136
+ ### Unlimited Power! [^](#contents)
137
+
138
+ #### Get Command [^](#contents)
139
+
140
+ The `get` command will download and extract `nhkore-core.zip` from the [latest release](https://github.com/esotericpig/nhkore/releases/latest) for you.
141
+
142
+ This already has tons of articles scraped so that you don't have to re-scrape them. Then, for example, you can easily create a CSV file from all of `2019` or all of `December 2019`.
143
+
144
+ Example usage:
145
+
146
+ `$ nhkore get`
147
+
148
+ By default, it will extract the data to `./core/`. You can change this:
149
+
150
+ `$ nhkore get -o 'my dir/'`
151
+
152
+ Complete demo:
153
+
154
+ [![asciinema Demo - Get](https://asciinema.org/a/317773.png)](https://asciinema.org/a/317773)
155
+
156
+ #### Sift Command [^](#contents)
157
+
158
+ After obtaining the scraped data, you can `sift` all of the data (or select data) into one of these file formats:
159
+
160
+ | Format | Typical Purpose |
161
+ | --- | --- |
162
+ | CSV | For uploading to a flashcard website (e.g., Memrise, Anki, Buffl) after changing the data appropriately. |
163
+ | HTML | For comfortable viewing in a web browser or for sharing. |
164
+ | YAML | For developers to automatically add translations or to manipulate the data in some other way programmatically. |
165
+
166
+ The data is sorted by frequency in descending order (i.e., most frequent words first).
167
+
168
+ If you wish to sort/arrange the data in some other way, CSV editors (e.g., LibreOffice, WPS Office, Microsoft Office) can do this easily and efficiently, or if you are code-savvy, you can programmatically manipulate the CSV/YAML/HTML file.
169
+
170
+ The defaults will sift all of the data into a CSV file, which may not be what you want:
171
+
172
+ `$ nhkore sift easy`
173
+
174
+ You can filter the data by using different options:
175
+
176
+ ```
177
+ -d --datetime=<value> date time to filter on; examples:
178
+ - '2020-7-1 13:10...2020-7-31 11:11'
179
+ - '2020-12' (2020, December 1st-31st)
180
+ - '7-4...7-9' (July 4th-9th of Current Year)
181
+ - '7-9' (July 9th of Current Year)
182
+ - '9' (9th of Current Year & Month)
183
+ -t --title=<value> title to filter on, where search text only
184
+ needs to be somewhere in the title
185
+ -u --url=<value> URL to filter on, where search text only
186
+ needs to be somewhere in the URL
187
+ ```
188
+
189
+ Filter examples:
190
+
191
+ ```
192
+ $ nhkore sift easy -d 2019
193
+ $ nhkore sift easy -d '2019-12'
194
+ $ nhkore sift easy -d '2019-7-4...9' # July 4th to 9th of 2019
195
+ $ nhkore sift easy -d '2019-12-25 13:10'
196
+ $ nhkore sift easy -t 'マリオ'
197
+ $ nhkore sift easy -u 'k10011862381000'
198
+ ```
199
+
200
+ You can save the data to a different format using one of these options:
201
+
202
+ ```
203
+ -e --ext=<value> type of file (extension) to save;
204
+ valid options: [csv, htm, html, yaml, yml];
205
+ not needed if you specify a file extension with
206
+ the '--out' option: '--out sift.html'
207
+ (default: csv)
208
+ -o --out=<value> 'directory/file' to save sifted data to;
209
+ if you only specify a directory or a file, it will
210
+ attach the appropriate default directory/file name
211
+ (defaults:
212
+ core/sift_nhk_news_web_easy{search.criteria}{file.ext},
213
+ core/sift_nhk_news_web_regular{search.criteria}{file.ext})
214
+ ```
215
+
216
+ Format examples:
217
+
218
+ ```
219
+ $ nhkore sift easy -e html
220
+ $ nhkore sift easy -e yml
221
+ $ nhkore sift easy -o 'mario.html'
222
+ $ nhkore sift easy -o 'sakura.yml'
223
+ ```
224
+
225
+ Lastly, you can ignore certain columns from the output. Definitions can be quite long, and English translations are currently always blank (meant to be filled in manually/programmatically).
226
+
227
+ ```
228
+ -D --no-defn do not output the definitions for words
229
+ (which can be quite long)
230
+ -E --no-eng do not output the English translations for words
231
+ ```
232
+
233
+ Complete demo:
234
+
235
+ [![asciinema Demo - Sift](https://asciinema.org/a/318119.png)](https://asciinema.org/a/318119)
236
+
237
+ ### Sakura Fields Forever [^](#contents)
238
+
239
+ #### Search Command [^](#contents)
240
+
241
+ The [news](#news-command-) command (for scraping articles) relies on having a file of article links.
242
+
243
+ Currently, the NHK website doesn't provide an historical record of all of its articles, and it's up to the user to find them.
244
+
245
+ The format of the file is simple, so you can edit it by hand (or programmatically) very easily:
246
+
247
+ ```YAML
248
+ # core/links_nhk_news_web_easy.yml
249
+ ---
250
+ links:
251
+ https://www3.nhk.or.jp/news/easy/k10012323711000/k10012323711000.html:
252
+ url: https://www3.nhk.or.jp/news/easy/k10012323711000/k10012323711000.html
253
+ scraped: false
254
+ https://www3.nhk.or.jp/news/easy/k10012321401000/k10012321401000.html:
255
+ url: https://www3.nhk.or.jp/news/easy/k10012321401000/k10012321401000.html
256
+ scraped: false
257
+ ```
258
+
259
+ Only the key (which is the URL) and the `url` field are required. The rest of the fields will be populated when you scrape the data.
260
+
261
+ > &lt;rambling&gt;
262
+ > Originally, I was planning on using a different key so that's why the URL is duplicated. This also allows for a possible future breaking version (major version change) to alter the key. In addition, I was originally planning to allow filtering in this file, so that's why additional fields are populated after scraping the data.
263
+ > &lt;/rambling&gt;
264
+
265
+ Example after running the `news` command:
266
+
267
+ ```YAML
268
+ # core/links_nhk_news_web_easy.yml
269
+ # - After being scraped
270
+ ---
271
+ links:
272
+ https://www3.nhk.or.jp/news/easy/k10012323711000/k10012323711000.html:
273
+ url: https://www3.nhk.or.jp/news/easy/k10012323711000/k10012323711000.html
274
+ scraped: true
275
+ datetime: '2020-03-11T16:00:00+09:00'
276
+ title: 安倍総理大臣「今月20日ごろまで大きなイベントをしないで」
277
+ futsuurl: https://www3.nhk.or.jp/news/html/20200310/k10012323711000.html
278
+ sha256: d1186ebbc2013564e52f21a2e8ecd56144ed5fe98c365f6edbd4eefb2db345eb
279
+ https://www3.nhk.or.jp/news/easy/k10012321401000/k10012321401000.html:
280
+ url: https://www3.nhk.or.jp/news/easy/k10012321401000/k10012321401000.html
281
+ scraped: true
282
+ datetime: '2020-03-11T11:30:00+09:00'
283
+ title: 島根県の会社 中国から技能実習生が来なくて困っている
284
+ futsuurl: https://www3.nhk.or.jp/news/html/20200309/k10012321401000.html
285
+ sha256: 2df91884fbbafdc69bc3126cb0cb7b63b2c24e85bc0de707643919e4581927a9
286
+ ```
287
+
288
+ If you don't wish to edit this file by hand (or programmatically), that's where the `search` command comes into play.
289
+
290
+ #### News Command [^](#contents)
291
+
292
+ ## Using the Library [^](#contents)
293
+
294
+ ### Setup
295
+
296
+ Pick your poison...
297
+
298
+ In your *Gemspec* (*&lt;project&gt;.gemspec*):
299
+
300
+ ```Ruby
301
+ spec.add_runtime_dependency 'nhkore', '~> X.X'
302
+ ```
303
+
304
+ In your *Gemfile*:
305
+
306
+ ```Ruby
307
+ # Pick one...
308
+ gem 'nhkore', '~> X.X'
309
+ gem 'nhkore', :git => 'https://github.com/esotericpig/psychgus.git', :tag => 'vX.X'
310
+ ```
311
+
312
+ ### Scraper
313
+
314
+ ## Hacking [^](#contents)
43
315
 
44
316
  ```
45
317
  $ git clone 'https://github.com/esotericpig/nhkore.git'
@@ -48,19 +320,35 @@ $ bundle install
48
320
  $ bundle exec rake -T
49
321
  ```
50
322
 
51
- ### Testing
323
+ Install Nokogiri:
52
324
 
53
325
  ```
54
- $ bundle exec rake test
326
+ $ bundle exec rake nokogiri_apt # Ubuntu/Debian
327
+ $ bundle exec rake nokogiri_dnf # Fedora/CentOS/Red Hat
328
+ $ bundle exec rake nokogiri_other # macOS, Windows, etc.
55
329
  ```
56
330
 
331
+ ### Running
332
+
333
+ `$ ruby -w lib/nhkore.rb`
334
+
335
+ ### Testing
336
+
337
+ `$ bundle exec rake test`
338
+
57
339
  ### Generating Doc
58
340
 
59
- ```
60
- $ bundle exec rake doc
61
- ```
341
+ `$ bundle exec rake doc`
342
+
343
+ ### Installing Locally (without Network Access)
344
+
345
+ `$ bundle exec rake install:local`
346
+
347
+ ### Releasing/Publishing
348
+
349
+ `$ bundle exec rake release`
62
350
 
63
- ## [License](#contents)
351
+ ## License [^](#contents)
64
352
 
65
353
  [GNU LGPL v3+](LICENSE.txt)
66
354
 
data/Rakefile CHANGED
@@ -26,14 +26,13 @@ require 'rake/clean'
26
26
  require 'rake/testtask'
27
27
  require 'raketeer/irb'
28
28
  require 'raketeer/nokogiri_installs'
29
- require 'raketeer/run'
30
29
  require 'yard'
31
30
  require 'yard_ghurt'
32
31
 
32
+ require 'nhkore/util'
33
33
  require 'nhkore/version'
34
34
 
35
35
 
36
- CORE_PKG_DIR = 'core_pkg'
37
36
  PKG_DIR = 'pkg'
38
37
 
39
38
  CLEAN.exclude('.git/','stock/')
@@ -46,16 +45,14 @@ desc 'Generate documentation (YARDoc)'
46
45
  task :doc => [:yard,:yard_gfm_fix] do |task|
47
46
  end
48
47
 
49
- desc "Package '#{File.join(CORE_PKG_DIR,'')}' data as a Zip file into '#{File.join(PKG_DIR,'')}'"
48
+ desc "Package '#{File.join(NHKore::Util::CORE_DIR,'')}' data as a Zip file into '#{File.join(PKG_DIR,'')}'"
50
49
  task :pkg_core do |task|
51
50
  mkdir_p PKG_DIR
52
51
 
53
- cd CORE_PKG_DIR do
54
- pattern = File.join('core','*.{csv,html,yml}')
55
- zip_file = File.join('..',PKG_DIR,'nhkore-core.zip')
56
-
57
- sh 'zip','-9rv',zip_file,*Dir.glob(pattern).sort()
58
- end
52
+ pattern = File.join(NHKore::Util::CORE_DIR,'*.{csv,html,yml}')
53
+ zip_file = File.join(PKG_DIR,'nhkore-core.zip')
54
+
55
+ sh 'zip','-9rv',zip_file,*Dir.glob(pattern).sort()
59
56
  end
60
57
 
61
58
  Rake::TestTask.new() do |task|
@@ -77,8 +74,8 @@ YARD::Rake::YardocTask.new() do |task|
77
74
  task.options += ['--title',"NHKore v#{NHKore::VERSION} Doc"]
78
75
  end
79
76
 
80
- # Execute "yard_gfm_fix" for production.
81
- # Execute "yard_gfm_fix[true]" for testing locally.
77
+ # Execute "rake yard_gfm_fix" for production.
78
+ # Execute "rake yard_gfm_fix[true]" for testing locally.
82
79
  YardGhurt::GFMFixTask.new() do |task|
83
80
  task.arg_names = [:dev]
84
81
  task.dry_run = false
@@ -86,10 +83,10 @@ YardGhurt::GFMFixTask.new() do |task|
86
83
  task.md_files = ['index.html']
87
84
 
88
85
  task.before = Proc.new() do |task,args|
89
- # Delete this file as it's never used (index.html is an exact copy)
86
+ # Delete this file as it's never used (index.html is an exact copy).
90
87
  YardGhurt::Util.rm_exist(File.join(task.doc_dir,'file.README.html'))
91
88
 
92
- # Root dir of my GitHub Page for CSS/JS
89
+ # Root dir of my GitHub Page for CSS/JS.
93
90
  GHP_ROOT = YardGhurt::Util.to_bool(args.dev) ? '../../esotericpig.github.io' : '../../..'
94
91
 
95
92
  task.css_styles << %Q(<link rel="stylesheet" type="text/css" href="#{GHP_ROOT}/css/prism.css" />)