RubyGems - nhkore - Versions diffs - 0.2.0 → 0.3.0 - Mend

nhkore 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +35 -1
data/README.md +305 -17
data/Rakefile +10 -13
data/lib/nhkore.rb +2 -1
data/lib/nhkore/app.rb +66 -43
data/lib/nhkore/article_scraper.rb +2 -2
data/lib/nhkore/cli/fx_cmd.rb +1 -1
data/lib/nhkore/cli/get_cmd.rb +27 -12
data/lib/nhkore/cli/news_cmd.rb +19 -7
data/lib/nhkore/cli/{bing_cmd.rb → search_cmd.rb} +125 -52
data/lib/nhkore/scraper.rb +123 -59
data/lib/nhkore/search_link.rb +4 -4
data/lib/nhkore/search_scraper.rb +70 -15
data/lib/nhkore/user_agents.rb +1179 -0
data/lib/nhkore/util.rb +36 -1
data/lib/nhkore/version.rb +1 -1
data/nhkore.gemspec +30 -18
metadata +22 -4

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 6ab82aafdbc996ca3f0f010d533adb165df24d63a4799ba8812596551506d52c
-  data.tar.gz: 5cb8b107928f7ba4c3e0100d70748b616f897a1b0e8c70149fdbf3ce09c39bd7
+  metadata.gz: a02c041aff2b0b040ff00acaeeaa54506e574e9575613b48c4dd75ef0ef45564
+  data.tar.gz: dd3f570c0e7b7223c039d4119989e04514ef0d4bf4fb53485397271021f39246
 SHA512:
-  metadata.gz: 2e85f11cb8b88605964e656234c746adb514d947fa46523fe464e601ae87cc1cdb7f8f32407e317c7e0c470cdfefde24251b7704cf014399a9eeb300fbd43936
-  data.tar.gz: 6b6aecea79efcf9f936667aa2d6a60b7255ee49de6929a576db468504e5084b254cb64729f8638de2c1814cc1223cd9a8ed04703d1a737ff805b3b2a5566102b
+  metadata.gz: 4f5021ab1fd74bb1c5a42574fa1045f71069b6f8ab6cf7b1717e6164505127e6c657f2e36be903dc190d356bed83fdf8c2de4c89644c7676863cfb9a8c53da8f
+  data.tar.gz: e082a6ed70bacccb763386e00d8ca92351d4ee8d9f2d32a9b79dc6a2733ea46cd739e95550124e4807281c58f3a65faf9b8496740a51cca13e068ecd3e882d3a

data/CHANGELOG.md CHANGED

@@ -2,7 +2,41 @@
 Format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
-## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.2.0...master)
+## [[Unreleased]](https://github.com/esotericpig/nhkore/compare/v0.3.0...master)
+## [v0.3.0] - 2020-04-12
+### Added
+- UserAgents
+    - Tons of random `User-Agent` strings for `Scraper`.
+### Changed
+- BingCmd => SearchCmd
+    - Major (breaking) change.
+    - Changed `$nhkore bing easy` to:
+        - `$ nhkore search easy bing`
+        - `$ nhkore se ez b`
+- App
+    - Added options:
+        - `--color` (force color output for demos)
+        - `--user-agent` (specify a custom HTTP header field `User-Agent`)
+    - If `out_dir` is empty, don't prompt if okay to overwrite.
+- README/nhkore.gemspec
+    - Added more info.
+    - Changed description.
+### Fixed
+- Scraper/BingScraper
+    - Big fix.
+    - Fixed to get around bing's strictness.
+        - Use a random `User-Agent` from `UserAgents`.
+        - Set HTTP header field `cookie` from `set-cookie` response.
+            - Added `http-cookie` gem.
+        - Use RSS as a fallback.
+- GetCmd
+    - When extracting files...
+        - ignore empty filenames in the Zip for safety.
+        - ask to overwrite files instead of erroring.
 ## [v0.2.0] - 2020-04-01
 First working version.

data/README.md CHANGED

@@ -10,20 +10,39 @@ A CLI app that scrapes [NHK News Web Easy](https://www3.nhk.or.jp/news/easy/) to
 This is similar to a [core word/vocabulary list](https://www.fluentin3months.com/core-japanese-words/), hence the name NHKore.
-In the future, I would like to add the regular NHK News, using the links from the easy versions.
+[![asciinema Demo - Help](https://asciinema.org/a/MQTJ9vxcpB7VYAKzke7m4QM7P.png)](https://asciinema.org/a/MQTJ9vxcpB7VYAKzke7m4QM7P?speed=2)
 ## Contents
-- [Installing](#installing)
-- [Using](#using)
-- [Hacking](#hacking)
-- [License](#license)
+- [For Non-Power Users](#for-non-power-users-)
+- [Installing](#installing-)
+- [Using](#using-)
+    - [The Basics](#the-basics-)
+    - [Unlimited Power!](#unlimited-power-)
+        - [Get Command](#get-command-)
+        - [Sift Command](#sift-command-)
+    - [Sakura Fields Forever](#sakura-fields-forever-)
+        - [Search Command](#search-command-)
+        - [News Command](#news-command-)
+- [Using the Library](#using-the-library-)
+- [Hacking](#hacking-)
+- [License](#license-)
-## [Installing](#contents)
+## For Non-Power Users [^](#contents)
+For non-Power Users, you are probably just interested in the data.
+[Click here](https://esotericpig.github.io/showcase/nhkore-ez.html) for a big HTML file of the final result from all of the current articles scraped.
+[Click here](https://github.com/esotericpig/nhkore/releases/latest) to go to the latest release and download `nhkore-core.zip` from the `Assets`. It contains all of the links scraped, all of the data scraped per article, and a final CSV file.
+If you'd like to try using the app, please download and install [Ruby](https://www.ruby-lang.org/en/downloads/) and then follow the instructions below. You'll need to be able to use the command line.
+## Installing [^](#contents)
 Pick your poison...
-With the RubyGems CLI package manager:
+With the RubyGems package manager:
 `$ gem install nhkore`
@@ -32,14 +51,267 @@ Manually:
 ```
 $ git clone 'https://github.com/esotericpig/nhkore.git'
 $ cd nhkore
-$ rake install
+$ gem build nhkore.gemspec
+$ gem install *.gem
 ```
-## [Using](#contents)
+If there are errors running `nhkore`, you may need to also [install Nokogiri](https://nokogiri.org/tutorials/installing_nokogiri.html) manually, which is used for scraping HTML.
+## Using [^](#contents)
+### The Basics [^](#contents)
-TODO: update README Using section
+The most useful thing to do is to simply scrape one article and then study the most frequent words before reading that article.
-## [Hacking](#contents)
+First, scrape the article:
+`$ nhkore news easy -u 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html'`
+If your internet is slow, there are several global options to help alleviate your internet woes, which can be used with any sub command:
+```
+-m --max-retry=<value>       maximum number of times to retry URLs
+                             (-1 or integer >= 0) (default: 3)
+-o --open-timeout=<value>    seconds for URL open timeouts
+                             (-1 or decimal >= 0)
+-r --read-timeout=<value>    seconds for URL read timeouts
+                             (-1 or decimal >= 0)
+-t --timeout=<value>         seconds for all URL timeouts: [open, read]
+                             (-1 or decimal >= 0)
+```
+Example usage:
+`$ nhkore -t 300 -m 10 news easy -u 'https://www3.nhk.or.jp/news/easy/k10011862381000/k10011862381000.html'`
+Some older articles will fail to scrape and need additional options (this is very rare):
+```
+-D --no-dict             do not try to parse the dictionary files
+                         for the articles; useful in case of errors
+                         trying to load the dictionaries (or for offline testing)
+-L --lenient             leniently (not strict) scrape articles:
+                           body & title content without the proper
+                           HTML/CSS classes/IDs and no futsuurl;
+                         example URLs:
+                         - https://www3.nhk.or.jp/news/easy/article/disaster_earthquake_02.html
+                         - https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html
+-M --missingno           very rarely an article will not have kana or kanji
+                         for a Ruby tag; to not raise an error, this will
+                         use previously scraped data to fill it in;
+                         example URL:
+                         - https://www3.nhk.or.jp/news/easy/k10012331311000/k10012331311000.html
+-d --datetime=<value>    date time to use as a fallback in cases
+                         when an article doesn't have one;
+                         format: YYYY-mm-dd H:M; example: 2020-03-30 15:30
+```
+Example usage:
+`$ nhkore -t 300 -m 10 news -D -L -M -d '2011-03-07 06:30' easy -u 'https://www3.nhk.or.jp/news/easy/tsunamikeihou/index.html'`
+Now that the data from the article has been scraped, you can generate a CSV/HTML/YAML file of the words ordered by frequency:
+```
+$ nhkore sift easy -e csv
+$ nhkore sift easy -e html
+$ nhkore sift easy -e yml
+```
+If you have other scraped articles, then you'll need to filter down to the specific one:
+| Command | Description |
+| --- | --- |
+| `$ nhkore sift easy -u k10011862381000` | Filter by URL |
+| `$ nhkore sift easy -t '植えられた桜'` | Filter by title |
+| `$ nhkore sift easy -d '2019-3-29 11:30'` | Filter by date time |
+| `$ nhkore sift easy -d '2019-3-29' -t '桜'` | Filter by date time &amp; title |
+| `$ nhkore sift easy -d '2019-3-29' -t '桜' -e html` | Filter &amp; output HTML |
+| `$ nhkore sift easy -d '2019-3-29' -t '桜' -o 'sakura.html'` | Filter &amp; output HTML |
+Complete demo:
+[![asciinema Demo - The Basics](https://asciinema.org/a/316571.png)](https://asciinema.org/a/316571)
+### Unlimited Power! [^](#contents)
+#### Get Command [^](#contents)
+The `get` command will download and extract `nhkore-core.zip` from the [latest release](https://github.com/esotericpig/nhkore/releases/latest) for you.
+This already has tons of articles scraped so that you don't have to re-scrape them. Then, for example, you can easily create a CSV file from all of `2019` or all of `December 2019`.
+Example usage:
+`$ nhkore get`
+By default, it will extract the data to `./core/`. You can change this:
+`$ nhkore get -o 'my dir/'`
+Complete demo:
+[![asciinema Demo - Get](https://asciinema.org/a/317773.png)](https://asciinema.org/a/317773)
+#### Sift Command [^](#contents)
+After obtaining the scraped data, you can `sift` all of the data (or select data) into one of these file formats:
+| Format | Typical Purpose |
+| --- | --- |
+| CSV | For uploading to a flashcard website (e.g., Memrise, Anki, Buffl) after changing the data appropriately. |
+| HTML | For comfortable viewing in a web browser or for sharing. |
+| YAML | For developers to automatically add translations or to manipulate the data in some other way programmatically. |
+The data is sorted by frequency in descending order (i.e., most frequent words first).
+If you wish to sort/arrange the data in some other way, CSV editors (e.g., LibreOffice, WPS Office, Microsoft Office) can do this easily and efficiently, or if you are code-savvy, you can programmatically manipulate the CSV/YAML/HTML file.
+The defaults will sift all of the data into a CSV file, which may not be what you want:
+`$ nhkore sift easy`
+You can filter the data by using different options:
+```
+-d --datetime=<value>    date time to filter on; examples:
+                         - '2020-7-1 13:10...2020-7-31 11:11'
+                         - '2020-12'   (2020, December 1st-31st)
+                         - '7-4...7-9' (July 4th-9th of Current Year)
+                         - '7-9'       (July 9th of Current Year)
+                         - '9'         (9th of Current Year & Month)
+-t --title=<value>       title to filter on, where search text only
+                         needs to be somewhere in the title
+-u --url=<value>         URL to filter on, where search text only
+                         needs to be somewhere in the URL
+```
+Filter examples:
+```
+$ nhkore sift easy -d 2019
+$ nhkore sift easy -d '2019-12'
+$ nhkore sift easy -d '2019-7-4...9'     # July 4th to 9th of 2019
+$ nhkore sift easy -d '2019-12-25 13:10'
+$ nhkore sift easy -t 'マリオ'
+$ nhkore sift easy -u 'k10011862381000'
+```
+You can save the data to a different format using one of these options:
+```
+-e --ext=<value>    type of file (extension) to save;
+                    valid options: [csv, htm, html, yaml, yml];
+                    not needed if you specify a file extension with
+                    the '--out' option: '--out sift.html'
+                    (default: csv)
+-o --out=<value>    'directory/file' to save sifted data to;
+                    if you only specify a directory or a file, it will
+                    attach the appropriate default directory/file name
+                    (defaults:
+                     core/sift_nhk_news_web_easy{search.criteria}{file.ext},
+                     core/sift_nhk_news_web_regular{search.criteria}{file.ext})
+```
+Format examples:
+```
+$ nhkore sift easy -e html
+$ nhkore sift easy -e yml
+$ nhkore sift easy -o 'mario.html'
+$ nhkore sift easy -o 'sakura.yml'
+```
+Lastly, you can ignore certain columns from the output. Definitions can be quite long, and English translations are currently always blank (meant to be filled in manually/programmatically).
+```
+-D --no-defn    do not output the definitions for words
+                (which can be quite long)
+-E --no-eng     do not output the English translations for words
+```
+Complete demo:
+[![asciinema Demo - Sift](https://asciinema.org/a/318119.png)](https://asciinema.org/a/318119)
+### Sakura Fields Forever [^](#contents)
+#### Search Command [^](#contents)
+The [news](#news-command-) command (for scraping articles) relies on having a file of article links.
+Currently, the NHK website doesn't provide an historical record of all of its articles, and it's up to the user to find them.
+The format of the file is simple, so you can edit it by hand (or programmatically) very easily:
+```YAML
+# core/links_nhk_news_web_easy.yml
+---
+links:
+  https://www3.nhk.or.jp/news/easy/k10012323711000/k10012323711000.html:
+    url: https://www3.nhk.or.jp/news/easy/k10012323711000/k10012323711000.html
+    scraped: false
+  https://www3.nhk.or.jp/news/easy/k10012321401000/k10012321401000.html:
+    url: https://www3.nhk.or.jp/news/easy/k10012321401000/k10012321401000.html
+    scraped: false
+```
+Only the key (which is the URL) and the `url` field are required. The rest of the fields will be populated when you scrape the data.
+> &lt;rambling&gt;
+> Originally, I was planning on using a different key so that's why the URL is duplicated. This also allows for a possible future breaking version (major version change) to alter the key. In addition, I was originally planning to allow filtering in this file, so that's why additional fields are populated after scraping the data.
+> &lt;/rambling&gt;
+Example after running the `news` command:
+```YAML
+# core/links_nhk_news_web_easy.yml
+# - After being scraped
+---
+links:
+  https://www3.nhk.or.jp/news/easy/k10012323711000/k10012323711000.html:
+    url: https://www3.nhk.or.jp/news/easy/k10012323711000/k10012323711000.html
+    scraped: true
+    datetime: '2020-03-11T16:00:00+09:00'
+    title: 安倍総理大臣「今月２０日ごろまで大きなイベントをしないで」
+    futsuurl: https://www3.nhk.or.jp/news/html/20200310/k10012323711000.html
+    sha256: d1186ebbc2013564e52f21a2e8ecd56144ed5fe98c365f6edbd4eefb2db345eb
+  https://www3.nhk.or.jp/news/easy/k10012321401000/k10012321401000.html:
+    url: https://www3.nhk.or.jp/news/easy/k10012321401000/k10012321401000.html
+    scraped: true
+    datetime: '2020-03-11T11:30:00+09:00'
+    title: 島根県の会社　中国から技能実習生が来なくて困っている
+    futsuurl: https://www3.nhk.or.jp/news/html/20200309/k10012321401000.html
+    sha256: 2df91884fbbafdc69bc3126cb0cb7b63b2c24e85bc0de707643919e4581927a9
+```
+If you don't wish to edit this file by hand (or programmatically), that's where the `search` command comes into play.
+#### News Command [^](#contents)
+## Using the Library [^](#contents)
+### Setup
+Pick your poison...
+In your *Gemspec* (*&lt;project&gt;.gemspec*):
+```Ruby
+spec.add_runtime_dependency 'nhkore', '~> X.X'
+```
+In your *Gemfile*:
+```Ruby
+# Pick one...
+gem 'nhkore', '~> X.X'
+gem 'nhkore', :git => 'https://github.com/esotericpig/psychgus.git', :tag => 'vX.X'
+```
+### Scraper
+## Hacking [^](#contents)
 ```
 $ git clone 'https://github.com/esotericpig/nhkore.git'
@@ -48,19 +320,35 @@ $ bundle install
 $ bundle exec rake -T
 ```
-### Testing
+Install Nokogiri:
 ```
-$ bundle exec rake test
+$ bundle exec rake nokogiri_apt   # Ubuntu/Debian
+$ bundle exec rake nokogiri_dnf   # Fedora/CentOS/Red Hat
+$ bundle exec rake nokogiri_other # macOS, Windows, etc.
 ```
+### Running
+`$ ruby -w lib/nhkore.rb`
+### Testing
+`$ bundle exec rake test`
 ### Generating Doc
-```
-$ bundle exec rake doc
-```
+`$ bundle exec rake doc`
+### Installing Locally (without Network Access)
+`$ bundle exec rake install:local`
+### Releasing/Publishing
+`$ bundle exec rake release`
-## [License](#contents)
+## License [^](#contents)
 [GNU LGPL v3+](LICENSE.txt)

data/Rakefile CHANGED

@@ -26,14 +26,13 @@ require 'rake/clean'
 require 'rake/testtask'
 require 'raketeer/irb'
 require 'raketeer/nokogiri_installs'
-require 'raketeer/run'
 require 'yard'
 require 'yard_ghurt'
+require 'nhkore/util'
 require 'nhkore/version'
-CORE_PKG_DIR = 'core_pkg'
 PKG_DIR = 'pkg'
 CLEAN.exclude('.git/','stock/')
@@ -46,16 +45,14 @@ desc 'Generate documentation (YARDoc)'
 task :doc => [:yard,:yard_gfm_fix] do |task|
 end
-desc "Package '#{File.join(CORE_PKG_DIR,'')}' data as a Zip file into '#{File.join(PKG_DIR,'')}'"
+desc "Package '#{File.join(NHKore::Util::CORE_DIR,'')}' data as a Zip file into '#{File.join(PKG_DIR,'')}'"
 task :pkg_core do |task|
   mkdir_p PKG_DIR
-  cd CORE_PKG_DIR do
-    pattern = File.join('core','*.{csv,html,yml}')
-    zip_file = File.join('..',PKG_DIR,'nhkore-core.zip')
-    sh 'zip','-9rv',zip_file,*Dir.glob(pattern).sort()
-  end
+  pattern = File.join(NHKore::Util::CORE_DIR,'*.{csv,html,yml}')
+  zip_file = File.join(PKG_DIR,'nhkore-core.zip')
+  sh 'zip','-9rv',zip_file,*Dir.glob(pattern).sort()
 end
 Rake::TestTask.new() do |task|
@@ -77,8 +74,8 @@ YARD::Rake::YardocTask.new() do |task|
   task.options += ['--title',"NHKore v#{NHKore::VERSION} Doc"]
 end
-# Execute "yard_gfm_fix" for production.
-# Execute "yard_gfm_fix[true]" for testing locally.
+# Execute "rake yard_gfm_fix" for production.
+# Execute "rake yard_gfm_fix[true]" for testing locally.
 YardGhurt::GFMFixTask.new() do |task|
   task.arg_names = [:dev]
   task.dry_run = false
@@ -86,10 +83,10 @@ YardGhurt::GFMFixTask.new() do |task|
   task.md_files = ['index.html']
   task.before = Proc.new() do |task,args|
-    # Delete this file as it's never used (index.html is an exact copy)
+    # Delete this file as it's never used (index.html is an exact copy).
     YardGhurt::Util.rm_exist(File.join(task.doc_dir,'file.README.html'))
-    # Root dir of my GitHub Page for CSS/JS
+    # Root dir of my GitHub Page for CSS/JS.
     GHP_ROOT = YardGhurt::Util.to_bool(args.dev) ? '../../esotericpig.github.io' : '../../..'
     task.css_styles << %Q(<link rel="stylesheet" type="text/css" href="#{GHP_ROOT}/css/prism.css" />)