RubyGems - bookshark - Versions diffs - 1.0.0.pre.2 → 1.0.1 - Mend

bookshark 1.0.0.pre.2 → 1.0.1

Files changed (78) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 55cf891db20054fa8dee78ebb422af2a2d2cb25c
-  data.tar.gz: f5c777f653ab0a52ba7e3d071d89f0bd62763954
+  metadata.gz: d35fee946c6b6dcf4ca740d89ba3a9cb89f36a94
+  data.tar.gz: ff928cdadd16b132adc9f193ff5c7f565a0b0398
 SHA512:
-  metadata.gz: 83e1699c160bd1c578c1335e8b6a4e490598e3588da23c29549d47f7f2860f4d3995c326f583e582d115c06ba8c57a8df224f5a3d77ea3b4f45db95ebe2e91a9
-  data.tar.gz: 0364e7d5cd6c6f6a01863c1f29bd8a0a6e14f5a2cf44bb66ecee76f86f136ca2da479573e3d64b457939f1f7b2a8bd6ae6f9652818d5ea1984d6b2d4a3f3e42a
+  metadata.gz: 2dac9ad4842172d896a488fa60baaf13d862c8d2e3b1e68b3a9f5e6ee28bec5136d4ed7d084c4bf6fa0f5f1cfd3224241958467c7df99929ea8cfc1c5ad92abc
+  data.tar.gz: e8f9dcb4f20e0a2330a91588c6dfbb306a74820ee9ed7fa7564097013c16244de46ffc3a636a751126f19ce9c1a32b209fb254d7533d901d6f77a00a6b8b5100

data/README.md CHANGED

@@ -13,7 +13,7 @@ The representation of bibliographic metadata in JSON is inspired by [BibJSON](ht
 Add this line to your application's Gemfile:
 ```ruby
-gem 'bookshark', "~> 1.0.0.pre"
+gem 'bookshark', "~> 1.0"
 ```
 And then execute:
@@ -22,7 +22,7 @@ And then execute:
 Or install it yourself as:
-    $ gem install bookshark --pre
+    $ gem install bookshark
 Require and include bookshark in your class/module.
@@ -85,7 +85,7 @@ extractor.book(isbn: '9789601411576')
 # Extract book with id 103788 from website
 extractor.book(id: 103788)
-# Extract book from the provided webpage
+# Extract book from the provided webpage
 extractor.book(uri: 'http://biblionet.gr/book/103788/')
 # Extract book with id 103788 from local storage
@@ -93,12 +93,12 @@ extractor.book(id: 103788, local: true)
 ```
 For more options, like book's title or author, use the search method which is described below.
-**Book Options**
+**Book Options**
 (Recommended option is to use just the id and let bookshark to generate uri):
 * id : The id of book on the corresponding site (Integer)
 * uri : The url of book web page or the path to local file.
-* local : Boolean value. Has page been saved locally? (default is false)
+* local : Boolean value. Has page been saved locally? (default is false)
 * format : The format in which the extracted data are returned
   * hash (default)
   * json
@@ -112,9 +112,9 @@ puts Bookshark::Extractor.new(format: 'pretty_json').book(id: 185281)
 #### Eager Extraction
-Each book has some attributes such as authors, contributors, categories etc which are actually references to other objects.
-By default when extracting a book, you get only names of these objects and references to their pages.
-With eager option set to true, each of these objects' data is extracted and the produced output contains complete information about every object.
+Each book has some attributes such as authors, contributors, categories etc which are actually references to other objects.
+By default when extracting a book, you get only names of these objects and references to their pages.
+With eager option set to true, each of these objects' data is extracted and the produced output contains complete information about every object.
 Eager extraction doesn't work with local option enabled.
 ```ruby
@@ -215,24 +215,24 @@ extractor.search(title: 'αρχοντας', author: 'τολκιν', results_type
 ```
 Searching and extracting several books can be very slow at times, so instead of extracting every single book you may prefer only the ids of found books. In that case pass the option `results_type: 'ids'`.
-**Search Options**:
+**Search Options**:
 With enought options you can customize your query to your needs. It is recommended to use at least two of the search options.
-* title (The title of book to search)
-* author (The author's last name is enough for filter the search)
+* title (The title of book to search)
+* author (The author's last name is enough for filter the search)
 * publisher
 * category
 * title_split
   * 0 (The exact title phrase must by matched)
-  * 1 (Default - All the words in title must be matched in whatever order)
+  * 1 (Default - All the words in title must be matched in whatever order)
   * 2 (At least one word should match)
-* book_id (Providing id means only one book should returned)
-* isbn
-* author_id (ID of the selected author)
-* publisher_id
-* category_id
-* after_year (Published this year or later)
-* before_year (Published this year or before)
+* book_id (Providing id means only one book should returned)
+* isbn
+* author_id (ID of the selected author)
+* publisher_id
+* category_id
+* after_year (Published this year or later)
+* before_year (Published this year or before)
 * results_type
   * metadata (Default - Every book is extracted and an array of metadata is returned)
   * ids (Only ids are returned)
@@ -243,7 +243,7 @@ With enought options you can customize your query to your needs. It is recommend
 Results with ids option look like that:
-```json
+```json
 {
  "book": [
     "119000",
@@ -271,7 +271,7 @@ Normally results are multiple books like the ones in book extractors:
     {
       "title": "Σημεία και τέρατα της οικονομίας",
       "subtitle": "Η κρυφή πλευρά των πάντων",
-      "... Rest of Metadata ...": "... condensed ..."
+      "... Rest of Metadata ...": "... condensed ..."
     },
     {
       "title": "Και άλλα σημεία και τέρατα από την ιστορία",
@@ -281,7 +281,7 @@ Normally results are multiple books like the ones in book extractors:
     {
       "title": "Σημεία και τέρατα από την ιστορία",
       "subtitle": null,
-      "... Rest of Metadata ...": "... condensed ..."
+      "... Rest of Metadata ...": "... condensed ..."
     }
   ]
 }
@@ -304,7 +304,7 @@ extractor.author(uri: 'storage/html_author_pages/2/author_2423.html', local: tru
 **Author Options**: (Recommended option is to use just the id and let bookshark to generate uri):
 * id : The id of author on the corresponding site (Integer)
 * uri : The url of author web page or the path to local file.
-* local : Boolean value. Has page been saved locally? (default is false)
+* local : Boolean value. Has page been saved locally? (default is false)
 The expected result of an author extraction is something like this:
@@ -329,7 +329,7 @@ The expected result of an author extraction is something like this:
   ]
 }
 ```
-The convention here is that there is never just a single author, but instead the author hash is stored inside an array.
+The convention here is that there is never just a single author, but instead the author hash is stored inside an array.
 So, it is easy to include metadata for multiple authors or even for multiple types of entities such as publishers or books on the same json file.
 ### Extract Publisher Data
@@ -342,7 +342,7 @@ extractor = Extractor.new(format: 'pretty_json')
 # Extract publisher with id 20 from website
 extractor.publisher(id: 20)
-# Extract publisher from the provided webpage
+# Extract publisher from the provided webpage
 extractor.publisher(uri: 'http://biblionet.gr/com/20/')
 # Extract publisher with id 20 from local storage
@@ -352,7 +352,7 @@ extractor.publisher(id: 20, local: true)
 * id : The id of publisher on the corresponding site (Integer)
 * uri : The url of publisher web page or the path to local file.
-* local : Boolean value. Has page been saved locally? (default is false)
+* local : Boolean value. Has page been saved locally? (default is false)
 * format : The format in which the extracted data are returned
   * hash (default)
   * json
@@ -397,7 +397,7 @@ The expected result of an author extraction is something like this:
           ],
           "fax": "210 3650069",
           "email": "info@patakis.gr",
-          "website": "www.patakis.gr"
+          "website": "www.patakis.gr"
         }
       },
       "b_id": "20"
@@ -415,7 +415,7 @@ extractor = Extractor.new(format: 'pretty_json')
 # Extract category with id 1041 from website
 extractor.category(id: 1041)
-# Extract category from the provided webpage
+# Extract category from the provided webpage
 extractor.category(uri: 'http://biblionet.gr/index/1041/')
 # Extract category with id 1041 from local storage
@@ -425,7 +425,7 @@ extractor.category(id: 1041, local: true)
 * id : The id of category on the corresponding site (Integer)
 * uri : The url of category web page or the path to local file.
-* local : Boolean value. Has page been saved locally? (default is false)
+* local : Boolean value. Has page been saved locally? (default is false)
 * format : The format in which the extracted data are returned
   * hash (default)
   * json
@@ -490,7 +490,7 @@ Take a look at this table:
 |---------|:-----------:|----------------------------------|
 | 103788  | book        | http://biblionet.gr/book/103788  |
 | 10207   | author      | http://biblionet.gr/author/10207 |
-| 20      | publisher   | http://biblionet.gr/com/20       |
+| 20      | publisher   | http://biblionet.gr/com/20       |
 | 1041    | category    | http://biblionet.gr/index/1041   |
 So if you want to use the uri option provide the target webpage's url as seen above without any slugs after th id.

data/bookshark.gemspec CHANGED

@@ -28,4 +28,5 @@ Gem::Specification.new do |spec|
   spec.add_development_dependency "bundler", ">= 1.6"
   spec.add_development_dependency "rake", "~> 10.0"
   spec.add_development_dependency 'rspec', "~> 3.2"
+  spec.add_development_dependency "webmock", "~> 1.2"
 end

data/lib/bookshark.rb CHANGED

@@ -10,6 +10,8 @@ require 'bookshark/extractors/search'
 require 'bookshark/crawlers/base'
 require 'bookshark/crawlers/publisher_crawler'
+require 'bookshark/crawlers/book_crawler'
+require 'bookshark/crawlers/bibliographical_record_crawler'
 module Bookshark
   DEFAULTS ||= {
@@ -76,7 +78,8 @@ module Bookshark
       uri = process_options(options, __method__)
       options[:format]  ||= @format
-      options[:eager]   ||= false
+      options[:eager]   ||= false
+      options[:nilify]  ||= false
       if options[:eager]
         book = eager_extract_book(uri)
@@ -86,8 +89,12 @@ module Bookshark
       response = {}
       response[:book] = !book.nil? ? [book] : []
+      return nil if response[:book].empty? and options[:nilify]
       response = change_format(response, options[:format])
-      response = book_extractor.decode_text(response)
+      response = book_extractor.decode_text(response) if response.class == "String"
       return response
     end
@@ -137,9 +144,9 @@ module Bookshark
       return response
     end
-    def books_from_storage
-      extract_from_storage_and_save('book', 'html_book_pages', 'json_book_pages')
-    end
+    # def books_from_storage
+    #   extract_from_storage_and_save('book', 'html_book_pages', 'json_book_pages')
+    # end
     def authors_from_storage
       extract_from_storage_and_save('author', 'html_author_pages', 'json_author_pages')
@@ -153,6 +160,17 @@ module Bookshark
       extract_from_storage_and_save('category', 'html_category_pages', 'json_category_pages')
     end
+    def extract_books_from_storage_and_save(start_id, finish_id, format = 'pretty_json')
+      start_id.upto(finish_id) do |book_id|
+        record = book(id: book_id, local: true, format: format, nilify: true)
+        dir_to_save = Bookshark.path_to_storage + '/' + 'json_book_records/' + "#{((book_id-1)/1000)}/" + "book_#{book_id}.json"
+        save_to(dir_to_save, record) unless record.nil?
+      end
+    end
     def extract_from_storage_and_save(metadata_type, source_dir, target_dir)
       list_directories(path: Bookshark.path_to_storage + '/' + source_dir).each do |dir|
         dir_to_save = dir.gsub(source_dir, target_dir)
@@ -168,8 +186,8 @@ module Bookshark
             record = author(options)
           when 'publisher'
             record = publisher(options)
-          when 'book'
-            record = book(options)
+          # when 'book'
+          #   record = book(options)
           when 'category'
             record = category(options)
           end
@@ -334,6 +352,16 @@ module Bookshark
       crawler.crawl_and_save
     end
+    def books(options = {})
+      crawler = Biblionet::Crawlers::BookCrawler.new(options)
+      crawler.crawl_and_save
+    end
+    def bibliographical_records(options = {})
+      crawler = Biblionet::Crawlers::BibliographicalRecordCrawler.new(options)
+      crawler.crawl_and_save
+    end
   end
 #   module Biblionet

data/lib/bookshark/crawlers/base.rb CHANGED

@@ -5,13 +5,14 @@ module Biblionet
     class Base
       def initialize(options = {})
-        @folder     = options[:folder]    ||= 'lib/bookshark/storage/html_base_pages'
-        @base_url   = options[:base_url]  ||= 'http://www.biblionet.gr/base/'
-        @page_type  = options[:page_type] ||= 'base'
-        @extension  = options[:extension] ||= '.html'
-        @start      = options[:start]     ||= 1
-        @finish     = options[:finish]    ||= 10000
-        @step       = options[:step]      ||= 1000
+        @folder             = options[:folder]            ||= 'lib/bookshark/storage/html_base_pages'
+        @base_url           = options[:base_url]          ||= 'http://www.biblionet.gr/base/'
+        @page_type          = options[:page_type]         ||= 'base'
+        @extension          = options[:extension]         ||= '.html'
+        @save_only_content  = options[:save_only_content] ||= false
+        @start              = options[:start]             ||= 1
+        @finish             = options[:finish]            ||= 10000
+        @step               = options[:step]              ||= 1000
       end
       def spider
@@ -20,7 +21,8 @@ module Biblionet
         start.step(finish, @step) do |last|
           first     = last - @step + 1
-          subfolder = (last/@step - 1).to_s
+          subfolder = (last/@step - 1).to_s
+          slash     = (@page_type != 'bg_record') ? '/' : ''
           path      = "#{@folder}/#{subfolder}/"
           # Create a new directory (does nothing if directory exists)
@@ -28,7 +30,7 @@ module Biblionet
           first.upto(last) do |id|
             file_to_save    = "#{path}#{@page_type}_#{id}#{@extension}"
-            url_to_download = "#{@base_url}#{id}/"
+            url_to_download = "#{@base_url}#{id}#{slash}"
             yield(url_to_download, file_to_save)
             # downloader = Biblionet::Core::Base.new(url_to_download)

data/lib/bookshark/crawlers/bibliographical_record_crawler.rb ADDED

@@ -0,0 +1,43 @@
+require_relative 'base'
+module Biblionet
+  module Crawlers
+    class BibliographicalRecordCrawler < Base
+      def initialize(options = {})
+        options[:folder]            ||= 'lib/bookshark/storage/html_book_pages'
+        options[:base_url]          ||= 'http://www.biblionet.gr/main.asp?page=results&Titlesid='
+        options[:page_type]         ||= 'bg_record'
+        options[:extension]         ||= '.html'
+        options[:save_only_content] ||= true
+        options[:start]             ||= 176001
+        options[:finish]            ||= 180000
+        options[:step]              ||= 1000
+        super(options)
+      end
+      def crawl_and_save
+        downloader = Extractors::Base.new
+        spider do |url_to_download, file_to_save|
+          downloader.load_page(url_to_download)
+          # Create a new directory (does nothing if directory exists)
+          path = File.dirname(file_to_save)
+          FileUtils.mkdir_p path unless File.directory?(path)
+          # No need to download the whole page. Just the part containing the book.
+          if @save_only_content
+            content_re = /<!-- CONTENT START -->.*<!-- CONTENT END -->/m
+            content    = content_re.match(downloader.page)[0] unless (content_re.match(downloader.page)).nil?
+            downloader.save_to(file_to_save, content) unless downloader.page.nil? or downloader.page.length < 1024
+          else
+            downloader.save_page(file_to_save) unless downloader.page.nil? or downloader.page.length < 1024
+          end
+        end
+      end
+    end
+  end
+end

data/lib/bookshark/crawlers/book_crawler.rb CHANGED

@@ -1,55 +1,43 @@
-require 'rubygems'
-require 'nokogiri'
-require 'open-uri'
-require 'fileutils'
-require File.expand_path(File.join(File.dirname(__FILE__), '../extractors', 'base'))
-# page = Nokogiri::HTML(open("raw_html_pages/book_45454.html"))
-# puts page.class   # => Nokogiri::HTML::Document
-# puts page
-FOLDER = 'html_book_pages'
-BASE_URL = 'http://www.biblionet.gr/book/'
-EXTENSION = '.html'
-301000.step(400000, 1000) do |last|
-  # saved_pages = 0
-  # empty_pages = 0
-  first = last - 1000 + 1
-  subfolder = (last/1000 - 1).to_s
-  path = "#{FOLDER}/#{subfolder}/"
-  # Create a new directory (does nothing if directory exists)
-  FileUtils.mkdir_p path
-  first.upto(last) do |id|
-    file_to_save = "#{path}book_#{id}#{EXTENSION}"
-    url_to_download = "#{BASE_URL}#{id}/"
-    downloader = Biblionet::Core::Base.new(url_to_download)
-    downloader.save_page(file_to_save) unless downloader.page.nil?
-    # open(url_to_parse) do |uri|
-    #   puts "Parsing page: #{url_to_parse}"
-    #   page = uri.read.gsub(/\s+/, " ")
-    #   # doc = Nokogiri::HTML(page)
-    #   # body = doc.at('title').inner_html
-    #   # puts body
-    #   if page.include? "</body>"
-    #     puts "Saving page: #{file_to_save}"
-    #     open(file_to_save, "w") do |file|
-    #       file.write(page)
-    #     end
-    #     saved_pages += 1
-    #   else
-    #     puts "Page #{file_to_save} seems to be empty..."
-    #     empty_pages += 1
-    #   end
-    # end
-  end
-  # puts "Saved Pages: #{saved_pages}"
-  # puts "Empty Pages: #{empty_pages}"
+require_relative 'base'
+module Biblionet
+  module Crawlers
+    class BookCrawler < Base
+      def initialize(options = {})
+        options[:folder]    ||= 'lib/bookshark/storage/html_book_pages'
+        options[:base_url]  ||= 'http://www.biblionet.gr/book/'
+        options[:page_type] ||= 'book'
+        options[:extension] ||= '.html'
+        options[:save_only_content] ||= true
+        options[:start]     ||= 1
+        options[:finish]    ||= 10000
+        options[:step]      ||= 1000
+        super(options)
+      end
+      def crawl_and_save
+        downloader = Extractors::Base.new
+        spider do |url_to_download, file_to_save|
+          downloader.load_page(url_to_download)
+          # Create a new directory (does nothing if directory exists)
+          path = File.dirname(file_to_save)
+          FileUtils.mkdir_p path unless File.directory?(path)
+          # No need to download the whole page. Just the part containing the book.
+          if @save_only_content
+            content_re = /<!-- CONTENT START -->.*<!-- CONTENT END -->/m
+            content    = content_re.match(downloader.page)[0] unless (content_re.match(downloader.page)).nil?
+            downloader.save_to(file_to_save, content) unless downloader.page.nil? or downloader.page.length < 1024
+          else
+            downloader.save_page(file_to_save) unless downloader.page.nil? or downloader.page.length < 1024
+          end
+        end
+      end
+    end
-end
+  end
+end