RubyGems - wiki-api - Versions diffs - 0.0.2 → 0.1.0 - Mend

wiki-api 0.0.2 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

checksums.yaml +8 -8
data/README.md +64 -33
data/lib/wiki/api/connect.rb +21 -7
data/lib/wiki/api/page.rb +23 -50
data/lib/wiki/api/page_block.rb +6 -4
data/lib/wiki/api/page_headline.rb +97 -2
data/lib/wiki/api/page_link.rb +9 -4
data/lib/wiki/api/page_list_item.rb +4 -2
data/lib/wiki/api/util.rb +12 -1
data/lib/wiki/api/version.rb +1 -1
data/test/unit/files/Wiktionary_program.html +4232 -0
data/test/unit/wiki_page_offline.rb +262 -0
data/wiki-api.gemspec +2 -2
metadata +8 -8
data/test/unit/wiki_page_config.rb +0 -45
data/test/unit/wiki_page_object.rb +0 -229

checksums.yaml CHANGED

@@ -1,15 +1,15 @@
 ---
 SHA1:
   metadata.gz: !binary |-
-    NjgxOGUxZjQ2MWQ2MjNhMDA2ZGUwMTRhOGI4MWFlOGQ3MzI4MWFjOA==
+    NjQ3MjZkMDdmNTg2YjdhZDRmM2E3MjU4ZjA1Y2IwOGYzODEwZTFkMA==
   data.tar.gz: !binary |-
-    ZmZkNDFhMzc0ZTNmZDBlYTFmMTIwMmU5ZDgzYTQ2YjM0ZTk1ZmQzYg==
+    YWE4Mzc4ZjRlYTBjNGE4MTkyYmE0OGFkOTJkMDViZTI0MjQ5MGFiMw==
 SHA512:
   metadata.gz: !binary |-
-    NGM4YTU2MjQ3Njk1MzJkMDhlYjcxODYxNDFkNzRlODI5MjMwNmU5ZGEzZmJj
-    MjhjZjYxYzcxMmYzYjA0YzA3NzdlYTJhMjM0ZTllNzgyMDk0MGJiNjBiZWRl
-    N2Y5YzMwZWZjZmY3NWQ0YmJiMjdiOTkwOTU1ZmE4MDg5Njk4M2Y=
+    OTNhMTZkNjMwNzJiMzU5YWE0ZDZiNzRlZWU5ZDJjM2Q1NTA5ZWRiN2IzY2Mw
+    MmU1ZDk0ODZhN2U4ODYwNjY0ZjdmY2U5ZTFkMDk4ZDA2MzIyODUzNjE0YzVl
+    OGE2ZmFmOTYyOWY2MWIyNGNlNmU5NjYwOTNkMGNhNjllOWM0YzQ=
   data.tar.gz: !binary |-
-    MGZlMTYzZTgzZWE3YmYzZmIyMjc0OTZhMGY0NDEwYzJmNmFiMTZkNDM3OGM2
-    Mjc1MDdjMzQ3MjM1NmVlODM3Mzg5ZTViMGRmOGI2NzE1NDZjODJhZTA2MjI5
-    NWE3YmI4MDYxY2I4NGM3MGUwNzAzNjQ3YjMwODU5NDBlMWYxZDM=
+    YjgzZGEzYzhhOWFmNzZhMjRlMWFiYmJiY2Q3N2EwOGQwZTBjY2Q0NzYxNWE2
+    ODc5NmMyNmYyODMyNmVmMjFmYzhhOTAzMTUzZTBmODU2OTMwY2RhYjg0Mjkz
+    Yjk3NjMzNGFlZGViYzQyOGQ5YzVjM2MzMjIyNWVlOWRhOTU0MDk=

data/README.md CHANGED

@@ -1,13 +1,17 @@
 # Wiki::Api
-Wiki API is a gem (Ruby on Rails) that interfaces with the MediaWiki API (https://www.mediawiki.org/wiki/API:Main_page). This gem is more than a interface, it has abstract classes like: Page on which you can request page parameters (like headlines, and text blocks within headlines).
+Wiki API is a gem (Ruby on Rails) that interfaces with the MediaWiki API (https://www.mediawiki.org/wiki/API:Main_page). This gem is more than a interface, it has abstract classes for Page and Headline parsing. You're able to iterate through these headlines, and access data accordingly.
-NOTE: nokogiri is used for background parsing of HTML. Because I believe there is no point of wrapping internals (composing) for this purpose, nokogiri nodes elements etc. are exposed (http://nokogiri.org/Nokogiri.html) through the wiki-api.
+NOTE: This gem has a nokogiri (http://nokogiri.org/Nokogiri.html) backend (for HTML parsing). Major components: Page, Headline, Block, ListItem, and Link are wrappers for easy data access, however it's still possible to retreive the raw HTML within these objects.
 Requests to the MediaWiki API use the following URI structure:
     http(s)://somemediawiki.org/w/api.php?action=parse&format=json&page="anypage"
+# RDoc (rdoc.info)
+    http://rdoc.info/github/dblommesteijn/wiki-api/frames/file/README.md
 ### Dependencies (production)
@@ -15,27 +19,27 @@ Requests to the MediaWiki API use the following URI structure:
 * nokogiri
-### Roadmap
+### Feature Roadmap
-* Version (0.0.2) (current)
+* Version (0.1.0)
-  Index important words per block, page, list item;
+  Major current release with several core changes.
-  Parse objects for more elements within a Page.
+* Version (0.1.1)
+  No features determined yet (please drop me a line if you're interested in additions).
 ### Changelog
-* Version (0.0.1) -> (0.0.2)
-  Nested ListItems, Links (within Page)
+* Version (0.0.2) -> (current)
-  Search on Page headline (ignore case, and underscore)
+  PageLink URI without global config Exception resolved
+  Reverse (parent) object lookup
-### Known Issues
+  Nested PageHeadline objects
-None discovered thus far.
 ## Installation
@@ -71,13 +75,16 @@ Wiki::Api::Connect.config = CONFIG
 ## Usage
-### Query a Page
+### Query a Page and Headline
 Requesting headlines from a given page.
 ```ruby
 page = Wiki::Api::Page.new name: "Wiktionary:Welcome,_newcomers"
-page.headlines.each do |headline|
+# the root headline equals the pagename
+puts page.root_headline.name
+# iterate next level of headlines
+page.root_headline.headlines.each do |headline_name, headline|
   # printing headline name (PageHeadline)
   puts headline.name
 end
@@ -87,29 +94,28 @@ Getting headlines for a given name.
 ```ruby
 page = Wiki::Api::Page.new name: "Wiktionary:Welcome,_newcomers"
-page.headline("Wiktionary:Welcome,_newcomers").each do |headline|
-  # printing headline name (PageHeadline)
-  puts headline.name
-end
+# lookup headline by name (underscore and case are ignored)
+headline = page.root_headline.headline("editing wiktionary").first
+# printing headline name (PageHeadline)
+puts headline.name
+# get the type of nested headline (html h1,2,3,4 etc.)
+puts headline.type
 ```
 ### Basic Page structure
 ```ruby
 page = Wiki::Api::Page.new name: "Wiktionary:Welcome,_newcomers"
 # iterate PageHeadline objects
-page.headlines.each do |headline|
+page.root_headline.headlines.each do |headline_name, headline|
   # exposing nokogiri internal elements
   elements = headline.elements.flatten
   elements.each do |element|
-    # access Nokogiri::XML::*
+    # print will result in: Nokogiri::XML::Text or Nokogiri::XML::Element
+    puts element.class
   end
   # string representation of all nested text
   block.to_texts
   # iterate PageListItem objects
   block.list_items.each do |list_item|
     # string representation of nested text
@@ -136,7 +142,7 @@ end
 ```
-### Example using Global config (https://en.wikipedia.org/wiki/Ruby_on_rails)
+### Example using Global config (https://en.wikipedia.org/wiki/Ruby_on_Rails)
 This is a example of querying wikipedia.org on the page: "Ruby_on_rails", and printing the References headline links for each list item.
@@ -146,35 +152,32 @@ CONFIG = { uri: "https://en.wikipedia.org" }
 Wiki::Api::Connect.config = CONFIG
 # querying the page
-page = Wiki::Api::Page.new name: "Ruby_on_rails"
+page = Wiki::Api::Page.new name: "Ruby_on_Rails"
 # get headlines with name Reference (there can be multiple headlines with the same name!)
-headlines = page.headline "References"
+headlines = page.root_headline.headline "References"
 # iterate headlines
 headlines.each do |headline|
   # iterate list items on the given headline
   headline.block.list_items.each do |list_item|
     # print the uri of all links
     puts list_item.links.map{ |l| l.uri }
   end
 end
 ```
-### Example passing URI (https://en.wikipedia.org/wiki/Ruby_on_rails)
+### Example passing URI (https://en.wikipedia.org/wiki/Ruby_on_Rails)
 This is the same example as the one above, except for setting a global config to direct the requests to a given URI.
 ```ruby
 # querying the page
-page = Wiki::Api::Page.new name: "Ruby_on_rails", uri: "https://en.wikipedia.org"
+page = Wiki::Api::Page.new name: "Ruby_on_Rails", uri: "https://en.wikipedia.org"
 # get headlines with name Reference (there can be multiple headlines with the same name!)
-headlines = page.headline "References"
+headlines = page.root_headline.headline "References"
 # iterate headlines
 headlines.each do |headline|
@@ -189,4 +192,32 @@ end
 ```
+### Example searching headlines
+This example shows how the headlines can be searched. For more info check: https://github.com/dblommesteijn/wiki-api/blob/master/lib/wiki/api/page.rb#L97
+```ruby
+# querying the page
+page = Wiki::Api::Page.new name: "Ruby_on_Rails", uri: "https://en.wikipedia.org"
+# NOTE: the following are all valid headline names:
+# request headline (by literal name)
+headlines = page.root_headline.headline "Philosophy_and_design"
+puts headlines.map{|h| h.name}
+# request headline (by downcase name)
+headlines = page.root_headline.headline "philosophy_and_design"
+puts headlines.map{|h| h.name}
+# request headline (by human name)
+headlines = page.root_headline.headline "philosophy and design"
+puts headlines.map{|h| h.name}
+# NOTE2: headlines are matched on headline.start_with?(requested_headline)
+# because of start_with? compare this should work as well!
+headlines = page.root_headline.headline "philosophy"
+puts headlines.map{|h| h.name}
+```

data/lib/wiki/api/connect.rb CHANGED

@@ -7,12 +7,13 @@ module Wiki
     class Connect
-      attr_accessor :uri, :api_path, :api_options, :http, :request, :response, :html, :parsed
+      attr_accessor :uri, :api_path, :api_options, :http, :request, :response, :html, :parsed, :file
       def initialize(options={})
         @@config ||= nil
         options.merge! @@config unless @@config.nil?
         self.uri = options[:uri] if options.include? :uri
+        self.file = options[:file] if options.include? :file
         self.api_path = options[:api_path] if options.include? :api_path
         self.api_options = options[:api_options] if options.include? :api_options
@@ -38,12 +39,25 @@ module Wiki
       def page page_name
         self.api_options[:page] = page_name
-        self.connect
-        response = self.response
-        json = JSON.parse response.body, {symbolize_names: true}
-        raise json[:error][:code] unless valid? json, response
-        self.html = json[:parse][:text]
-        self.parsed = Nokogiri::HTML self.html[:*]
+        # parse page by uri
+        if !self.uri.nil? && self.file.nil?
+          self.connect
+          response = self.response
+          json = JSON.parse response.body, {symbolize_names: true}
+          raise json[:error][:code] unless valid? json, response
+          self.html = json[:parse][:text]
+          self.parsed = Nokogiri::HTML self.html[:*]
+        # parse page by file
+        elsif !self.file.nil?
+          f = File.open(self.file)
+          # self.parsed = Nokogiri::HTML self.html[:*]
+          self.parsed = Nokogiri::HTML(f)
+          f.close
+        # invalid config, raise exception
+        else
+          raise "no :uri or :file config found!"
+        end
+        self.parsed
       end
       class << self

data/lib/wiki/api/page.rb CHANGED

@@ -1,49 +1,34 @@
 module Wiki
   module Api
+    # MediaWiki Page, collection of all html information plus it's page title
     class Page
-      attr_accessor :name, :parsed_page, :uri
+      attr_accessor :name, :parsed_page, :uri, :parent
       def initialize(options={})
         self.name = options[:name] if options.include? :name
-        uri = options[:uri] if options.include? :uri
-        @@config ||= nil
-        if @@config.nil? || !uri.nil?
-          # use the connection to collect HTML pages for parsing
-          @connect = Wiki::Api::Connect.new uri: uri
-        else
-          # using a local HTML file for parsing
-        end
+        self.uri = options[:uri] if options.include? :uri
+        @connect = Wiki::Api::Connect.new uri: uri
       end
-      def headlines
-        headlines = []
-        self.parse_blocks.each do |headline_name, elements|
-          headline = PageHeadline.new name: headline_name
-          elements.each do |element|
-            # nokogiri element
-            headline.block << element
-          end
-          headlines << headline
-        end
-        headlines
+      def connect
+        @connect
       end
-      def headline headline_name
-        headlines = []
-        self.parse_blocks(headline_name).each do |headline_name, elements|
-          headline = PageHeadline.new name: headline_name
-          elements.each do |element|
-            # nokogiri element
-            headline.block << element
-          end
-          headlines << headline
-        end
-        headlines
+      # collect all headlines, keep original page formatting
+      def root_headline
+        self.parse_blocks
       end
+      # # collect headlines by given name, this will flatten the nested headlines
+      # def flat_headlines_by_name headline_name
+      #   raise "not yet implemented!"
+      #   # TODO: implement flattening of headlines within the root headline
+      #   # ALT:  breath search option in the root of the first headline
+      #   self.parse_blocks(headline_name)
+      # end
       def to_html
@@ -55,22 +40,8 @@ module Wiki
         self.parse_page = nil
       end
-      class << self
-        def config=(config = {})
-          @@config = config
-        end
-      end
-      protected
       def load_page!
-        if @@config.nil?
-          self.parsed_page ||= @connect.page self.name
-        elsif self.parsed_page.nil?
-          f = File.open(@@config[:file])
-          self.parsed_page = Nokogiri::HTML(f)
-          f.close
-        end
+        self.parsed_page ||= @connect.page self.name
       end
@@ -81,11 +52,12 @@ module Wiki
         # get headline nodes by span class
         xs = self.parsed_page.xpath("//span[@class='mw-headline']")
         # filter single headline by name (ignore case)
         xs = self.filter_headline xs, headline_name unless headline_name.nil?
         # NOTE: first_part has no id attribute and thus cannot be filtered or processed within xpath (xs)
-        if headline_name == self.name || headline_name.nil?
+        if headline_name.nil? || headline_name.start_with?(self.name.downcase)
           x = self.first_part
           result[self.name] ||= []
           result[self.name] << (self.collect_elements(x.parent))
@@ -95,11 +67,12 @@ module Wiki
         xs.each do |x|
           headline = x.attributes["id"].value
           elements = self.collect_elements x.parent.next
-          result[headline] ||= []
+          result[headline] ||= []
           result[headline] << elements
         end
-        result
+        # create root object
+        PageHeadline.new parent: self, name: result.first[0], headlines: result, level: 0
       end
       # harvest first part of the page (missing heading and class="mw-headline")

data/lib/wiki/api/page_block.rb CHANGED

@@ -1,20 +1,22 @@
 module Wiki
   module Api
+    # Collection of elements for segmented per headline
     class PageBlock
-      attr_accessor :elements
+      attr_accessor :elements, :parent
       def initialize options={}
+        self.parent = options[:parent] if options.include? :parent
         self.elements = []
       end
       def << value
+        # value.first.previous.name
         self.elements << value
       end
       def to_texts
-        # TODO: perhaps we should wrap the elements with objects??
         texts = []
         self.elements.flatten.each do |element|
           text = Wiki::Api::Util.element_to_text element if element.is_a? Nokogiri::XML::Element
@@ -28,14 +30,14 @@ module Wiki
       def list_items
         # TODO: perhaps we should wrap the elements with objects, and request a li per element??
         self.search("li").map do |list_item|
-          PageListItem.new element: list_item
+          PageListItem.new parent: self, element: list_item
         end
       end
       def links
         # TODO: perhaps we should wrap the elements with objects, and request a li per element??
         self.search("a").map do |a|
-          PageLink.new element: a
+          PageLink.new parent: self, element: a
         end
       end

data/lib/wiki/api/page_headline.rb CHANGED

@@ -1,20 +1,115 @@
 module Wiki
   module Api
+    # Headline for a page (class="mw-healine")
     class PageHeadline
-      attr_accessor :name, :block
+      require 'json'
+      LEVEL = ["text", "h1", "h2", "h3", "h4", "h5", "h6"]
+      attr_accessor :name, :block, :parent, :headlines, :level
       def initialize options={}
         self.name = options[:name] if options.include? :name
-        self.block = PageBlock.new
+        self.parent = options[:parent] if options.include? :parent
+        self.level = options[:level] if options.include? :level
+        options[:headlines] ||= []
+        self.headlines ||= {}
+        # store elements in a block
+        self.block = PageBlock.new parent: self
+        if options[:headlines].include? self.name
+          options[:headlines][self.name].each do |element|
+            self.block << element
+          end
+        end
+        # collect nested headlines
+        headlines = options[:headlines]
+        # remove self from list
+        headlines.delete self.name
+        nested_headlines = self.nested_headlines headlines, self.name, self.level
+        # iterate nested headlines, and call recursive
+        nested_headlines.each do |headline_name, value|
+          level = LEVEL.index value.first.first.previous.name
+          self.headlines[headline_name] = (PageHeadline.new parent: self, name: headline_name, headlines: headlines, level: level)
+        end
       end
       def elements
         self.block.elements
       end
+      def type
+        self.block.elements.first.first.previous.name
+      end
+      # get headline by name
+      def headline name
+        name = name.downcase.gsub(" ", "_")
+        self.headlines.reject do |k,v|
+          !k.downcase.start_with?(name)
+        end.values()
+      end
+      # recursive headline search
+      # def headline_by_name name, depth = 1
+      #   name = name.downcase.gsub(" ", "_")
+      #   ret = []
+      #   self.headlines.each do |k,v|
+      #     ret << v if k.downcase.start_with?(name)
+      #     next if v.headlines.empty?
+      #     if depth > 0
+      #       q = v.headline_by_name name, (depth - 1)
+      #       ret.concat q
+      #     end
+      #   end
+      #   ret
+      # end
+      # headline exists for current headline
+      def has_headline? name
+        name = name.downcase.gsub(" ", "_")
+        self.headlines.each do |k,v|
+          return true if k.downcase.start_with?(name)
+        end
+        false
+      end
+      def to_hash
+        ret = {name: self.name, headlines: [], type: self.type}
+        self.headlines.each do |headline_name, headline|
+          ret[:headlines] << headline.to_hash
+        end
+        ret
+      end
+      def to_pretty_json
+        JSON.pretty_generate self.to_hash
+      end
+      protected
+      # filter nested headlines (elements) from a parent headline (by name)
+      def nested_headlines headlines, name, original_level
+        ret = {}
+        init_level = nil
+        # iterate headlines, skip already done onces
+        #headlines.drop(headline_index + 1).each do |headline|
+        headlines.to_a.each do |name, value|
+          level = LEVEL.index value.first.first.previous.name
+          init_level ||= level
+          # lower level indicate nest end
+          break if level <= original_level
+          break if level < init_level
+          # higher level indicates nested items, these will be processed recursive
+          next if init_level != level
+          ret[name] = value
+        end
+        ret
+      end
     end