RubyGems - bc_crawler - Versions diffs - 0.0.4 - Mend

bc_crawler 0.0.4

Files changed (16) hide show

data/.gitignore ADDED

@@ -0,0 +1,22 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp
+*.bundle
+*.so
+*.o
+*.a
+mkmf.log

data/Gemfile ADDED

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in bc_crawler.gemspec
+gemspec

data/LICENSE.txt ADDED

@@ -0,0 +1,22 @@
+Copyright (c) 2014 Mario Schuettel
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,252 @@
+# BcCrawler
+A simple Ruby Gem to crawl bandcamp.com sites. It will load information about the artist/label/band, their releases (albums) and all tracks.
+## Installation
+Add this line to your application's Gemfile:
+    gem 'bc_crawler'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install bc_crawler
+## Usage
+### Crawl an artist/label/band
+```ruby
+require 'bc_crawler'
+main = BcCrawler::Main.new('https://amandapalmer.bandcamp.com/')
+ => URL: https://amandapalmer.bandcamp.com/
+main.releases.first
+ =>  URL : https://amandapalmer.bandcamp.com//album/an-evening-with-neil-gaiman-and-amanda-palmer
+    Data :
+```
+Initially, the data attribute is empty, because only the "main"-page has been crawled.
+### Crawl a release
+```ruby
+main.releases.first.crawl
+main.releases.first
+ =>  URL : https://amandapalmer.bandcamp.com//album/an-evening-with-neil-gaiman-and-amanda-palmer
+    Data : { Hash }
+```
+### Crawl all releases from an artist/label/band at once
+```ruby
+main.crawl
+# Crawling https://amandapalmer.bandcamp.com//album/an-evening-with-neil-gaiman-and-amanda-palmer
+# Crawling https://amandapalmer.bandcamp.com//album/theatre-is-evil-2
+# Crawling https://amandapalmer.bandcamp.com//album/amanda-palmer-goes-down-under
+# Crawling https://amandapalmer.bandcamp.com//album/amanda-palmer-performs-the-popular-hits-of-radiohead-on-her-magical-ukulele
+# Crawling https://amandapalmer.bandcamp.com//album/nighty-night
+# Crawling https://amandapalmer.bandcamp.com//album/who-killed-amanda-palmer
+# Crawling https://amandapalmer.bandcamp.com//album/who-killed-amanda-palmer-alternate-tracks
+# Crawling https://amandapalmer.bandcamp.com//album/map-of-tasmania-the-remix-project
+# Crawling https://amandapalmer.bandcamp.com//album/7-series-part-3
+```
+Certain information about releases and tracks can directly be accessed by attributes.
+### Release information
+```ruby
+release = main.releases.first
+release.artist
+ => "Neil Gaiman and Amanda Palmer"
+release.band_id
+ => 3463798201
+release.type
+ => "album"
+release.title
+ => "An Evening With Neil Gaiman and Amanda Palmer"
+release.id              # "Relase ID"
+ => 3510389344
+release.release_date
+ => "19 Nov 2013 00:00:00 GMT"
+release.featured_track_id
+ => 658956410
+release.about
+ => nil
+release.credits
+ => nil
+release.art_fullsize_url
+ => "https://f1.bcbits.com/img/a3489132960_10.jpg"
+release.art_thumb_url
+ => "https://f1.bcbits.com/img/a3489132960_3.jpg"
+release.art_id
+ => nil
+release.has_audio
+ => true
+release.purchase_url
+ => nil
+```
+A release holds one track or more in an array. Each track has these attributes
+### Track information
+```ruby
+random_track = release.tracks[rand(0..release.tracks.count)]
+random_track.id         # "Track ID"
+ => 658956410
+random_track.track_num
+ => 32
+random_track.title
+ => "Judy Blume"
+random_track.duration
+ => 395.093
+random_track.url
+ => "https://amandapalmer.bandcamp.com//track/judy-blume-2"
+random_track.is_downloadable
+ => true
+random_track.streaming
+ => 1
+random_track.file
+ => {"mp3-128"=>"http://popplers5.bandcamp.com/download/track?enc=mp3-128&fsig=6667d236f0f0128472b2d505feb8f43a&id=658956410&stream=1&ts=1417597933.0"}
+random_track.is_draft
+ => false
+random_track.title_link
+ => "/track/judy-blume-2"
+```
+If the information above is not enough, you can access the entire data object from Bandcamp in the release.data attribute
+release.data structure
+```JSON
+{
+  "artFullsizeUrl": "https://f1.bcbits.com/img/a3489132960_10.jpg",
+  "artThumbURL": "https://f1.bcbits.com/img/a3489132960_3.jpg",
+  "current": {
+      "is_set_price": null,
+      "purchase_title": null,
+      "minimum_price_nonzero": 10,
+      "killed": null,
+      "publish_date": "07 Nov 2013 15:27:37 GMT",
+      "mod_date": "22 Nov 2013 20:01:15 GMT",
+      "art_id": 3489132960,
+      "minimum_price": 10,
+      "featured_track_id": 658956410,
+      "auto_repriced": null,
+      "require_email": null,
+      "download_pref": 2,
+      "title": "An Evening With Neil Gaiman and Amanda Palmer",
+      "new_desc_format": 1,
+      "about": null,
+      "require_email_0": null,
+      "private": null,
+      "artist": "Neil Gaiman and Amanda Palmer",
+      "id": 3510389344,
+      "band_id": 3463798201,
+      "credits": null,
+      "upc": null,
+      "set_price": 7,
+      "new_date": "07 Nov 2013 14:50:34 GMT",
+      "type": "album",
+      "purchase_url": null,
+      "release_date": "19 Nov 2013 00:00:00 GMT",
+      "download_desc_id": null
+  },
+  "hasAudio": true,
+  "trackinfo": [
+    "(all tracks go here... see 'trackinfo')"
+  ],
+  "url": "http://amandapalmer.bandcamp.com/album/an-evening-with-neil-gaiman-and-amanda-palmer"
+}
+```
+Assuming you want the "minimum_price" of a release
+```ruby
+release.data['current']['minimum_price']
+ => 10.0
+```
+The "trackinfo" in release.data looks like this
+```JSON
+{
+    "video_poster_url": null,
+    "is_draft": false,
+    "title_link": "/track/my-last-landlady-3",
+    "download_tooltip": "",
+    "video_caption": null,
+    "has_lyrics": false,
+    "sizeof_lyrics": 0,
+    "duration": 391.821,
+    "license_type": 1,
+    "video_featured": null,
+    "has_info": false,
+    "title": "My Last Landlady",
+    "video_source_type": null,
+    "track_num": 1,
+    "private": null,
+    "alt_link": null,
+    "video_id": null,
+    "is_downloadable": false,
+    "video_source_id": null,
+    "lyrics": null,
+    "album_preorder": false,
+    "id": 1844797083,
+    "encoding_error": null,
+    "has_free_download": null,
+    "video_mobile_url": null,
+    "streaming": 1,
+    "unreleased_track": false,
+    "file": {
+        "mp3-128": "http://popplers5.bandcamp.com/download/track?enc=mp3-128&fsig=25ddaa2b8fa8a008562e4e0c6efc2eff&id=1844797083&stream=1&ts=1417597933.0"
+    },
+    "encoding_pending": null,
+    "free_album_download": false,
+    "encodings_id": 3584714018
+}
+```
+Assuming you want to know if the first track of a release "has_lyrics":
+```ruby
+    release.data['trackinfo'][0]['has_lyrics']
+     => false
+```
+## Contributing
+1. Fork it ( https://github.com/[my-github-username]/bc_crawler/fork )
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create a new Pull Request

data/Rakefile ADDED

@@ -0,0 +1,3 @@
+require "bundler/gem_tasks"
+Dir.glob('tasks/**/*.rake').each(&method(:import))

data/bc_crawler.gemspec ADDED

@@ -0,0 +1,26 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'bc_crawler/version'
+Gem::Specification.new do |spec|
+  spec.name          = 'bc_crawler'
+  spec.version       = BcCrawler::VERSION
+  spec.authors       = ['Mario Schuettel']
+  spec.email         = ["github@lxxxvi.ch"]
+  spec.summary       = 'Crawl Bandcamp Sites'
+  spec.description   = 'Allows to crawl bandcamp sites, including release and track information'
+  spec.homepage      = ''
+  spec.license       = 'MIT'
+  spec.files         = `git ls-files -z`.split("\x0")
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ['lib']
+  spec.add_development_dependency "bundler", "~> 1.6"
+  spec.add_development_dependency "rake"
+  spec.add_development_dependency 'rspec'
+  spec.add_runtime_dependency 'json', '>= 1.8.1'
+end

data/lib/bc_crawler.rb ADDED

@@ -0,0 +1,11 @@
+require 'open-uri'
+require 'set'
+require 'json'
+require 'bc_crawler/version'
+require 'bc_crawler/helper'
+require 'bc_crawler/main'
+require 'bc_crawler/release'
+require 'bc_crawler/track'
+module BcCrawler
+end

data/lib/bc_crawler/helper.rb ADDED

@@ -0,0 +1,7 @@
+module BcCrawler
+  class Helper
+    def self.get_base_url(url)
+      url[/https?:\/\/(.*?)\//]
+    end
+  end
+end

data/lib/bc_crawler/main.rb ADDED

@@ -0,0 +1,45 @@
+# BcCrawler (Bandcamp Crawler) can be used to fetch release data
+# from a given artist, band or label on bandcamp.com.
+# It will fetch the main information such as band name, release name,
+# track name, track duration, track number, etc.
+module BcCrawler
+  class Main
+    attr_accessor :releases, :url
+    def initialize(url)
+      @url        = url
+      @releases   = []
+      # call the page
+      html = open(@url).read
+      release_paths = Set.new
+      # get all "a" elements that target an /album/... URL
+      html.scan(/<a href="\/album\/(.*?)"/).each { |r| release_paths << "/album/#{r.first}" }
+      # TODO: implement single tracks, that are not assigned to an album, but directly to the artist
+      # initialize the release(s)
+      release_paths.each do |path|
+        @releases << BcCrawler::Release.new("#{ @url }#{ path }")
+      end
+    end
+    def crawl
+      # fetch information about the release
+      @releases.each do |release|
+        release.crawl
+      end
+    end
+    def to_s
+    <<-EOF
+    URL : #{ @url }
+    Number of releases : #{ @releases.count }
+    EOF
+    end
+  end
+end

data/lib/bc_crawler/release.rb ADDED

@@ -0,0 +1,87 @@
+module BcCrawler
+  class Release
+    attr_reader :art_fullsize_url, :art_thumb_url, :art_id, :about, :featured_track_id,
+                :credits, :artist, :purchase_url, :band_id, :id, :release_date,
+                :type, :title, :tracks, :has_audio, :url, :html, :data
+    def initialize(url)
+      @url = url
+      @tracks = []
+    end
+    # Scan the HTML for a particular JavaScript snippet where a variable named "TralbumData" is assigned.
+    # TralbumData contains all information about the release (and its tracks), but has to be cleaned first
+    # in order to get a valid JSON object.
+    #
+    # By default, only the main nodes in TralbumData are crawled. There are more nodes available.
+    #
+    #   nodes = %w(album_is_preorder album_release_date artFullsizeUrl artist artThumbURL
+    #              current defaultPrice featured_track_id FREE freeDownloadPage hasAudio
+    #              id initial_track_num is_preorder item_type last_subscription_item
+    #              maxPrice minPrice packages PAID playing_from preorder_count trackinfo url)
+    def crawl(nodes = %w(artFullsizeUrl artThumbURL current hasAudio trackinfo url))
+      puts "Crawling #{@url}"
+      @nodes = nodes
+      # call the URL, fetch the JavaScript code (TralbumData) and clean the string
+      @html = open(@url).read
+      js_content = html.gsub(/\n/, '~~')[/var TralbumData = \{(.*?)\};/, 1] # get content of JS variable TralbumData
+                       .gsub('~~', "\n")                                  # undo line endings replacement
+                       .gsub("\t", '')                                    # remove tabs
+                       .gsub("\" + \"", '')                               # special bug in "url" node
+      # scan the JavaScript code text for the given nodes
+      json_nodes = []
+      @nodes.each do |node|
+        json_nodes << js_content[/^( )*#{node}( )*:.*$/]                  # fetch current node in JavaScript object
+                               .gsub(/#{node}/, "\"#{node}\"")            # add double quotes around node name
+                               .gsub(/( )*,( )*$/, '')                    # remove empty lines with comma
+      end
+      @data = JSON.parse("{ #{ json_nodes.join(', ') } }")
+      # Finally, we load the release info
+      load_release_info
+    end
+    # Assign some of the  main information to instance variables
+    # TODO: make ALL information available as instance variables
+    def load_release_info
+      @art_fullsize_url   = @data['artFullsizeUrl']
+      @art_thumb_url      = @data['artThumbURL']
+      @art_id             = @data['current']['art_it']
+      @about              = @data['current']['about']
+      @featured_track_id  = @data['current']['featured_track_id']
+      @credits            = @data['current']['credits']
+      @artist             = @data['current']['artist']
+      @purchase_url       = @data['current']['purchase_url']
+      @band_id            = @data['current']['band_id']
+      @id                 = @data['current']['id']
+      @release_date       = @data['current']['release_date']
+      @type               = @data['current']['type']
+      @title              = @data['current']['title']
+      @has_audio          = @data['hasAudio']
+      load_track_info
+    end
+    # Tracks have their own class
+    def load_track_info
+      @data['trackinfo'].each do |track|
+        @tracks << Track.new(self, track)
+      end
+    end
+    def to_s
+      <<-EOF
+      URL : #{ @url }
+      Artist : #{ @artist }
+      Release title : #{ @title }
+      Number of tracks : #{ @tracks.count }
+      #{ '(use .crawl method to fetch the missing data)' if @artist.nil? }
+      EOF
+    end
+  end
+end

data/lib/bc_crawler/track.rb ADDED

@@ -0,0 +1,30 @@
+module BcCrawler
+  class Track
+    attr_reader :duration, :track_num, :is_downloadable, :streaming,
+                :is_draft, :id, :title_link, :file, :title, :url
+    def initialize(release, track)
+      @release            = release
+      @duration           = track['duration']
+      @track_num          = track['track_num']
+      @is_downloadable    = track['is_downloadable']
+      @streaming          = track['streaming']
+      @is_draft           = track['is_draft']
+      @id                 = track['id']
+      @title_link         = track['title_link']
+      @file               = track['file']
+      @title              = track['title']
+      @url                = "#{ BcCrawler::Helper.get_base_url(@release.url) }#{ track['title_link'] }"
+    end
+    def to_s
+      <<-EOF
+      URL : #{ @url }
+      Track number : #{ @track_num }
+      Track name : #{ @title }
+      Duration : #{ @duration }
+      EOF
+    end
+  end
+end

data/lib/bc_crawler/version.rb ADDED

@@ -0,0 +1,3 @@
+module BcCrawler
+  VERSION = '0.0.4'
+end

data/spec/bc_crawler_spec.rb ADDED

@@ -0,0 +1,30 @@
+require 'spec_helper'
+describe BcCrawler do
+  before(:all) do
+    @test_release_url = 'http://amandapalmer.bandcamp.com/album/amanda-palmer-performs-the-popular-hits-of-radiohead-on-her-magical-ukulele'
+  end
+  it 'returns the base url' do
+    base_url = BcCrawler::Helper.get_base_url('https://abc.bandcamp.com/album/of-the-year')
+    expect(base_url).to eq('https://abc.bandcamp.com/')
+  end
+  it 'crawls the main page' do
+    main_page = BcCrawler::Main.new('http://amandapalmer.bandcamp.com/')
+    expect(main_page.releases.count).to be > 0
+  end
+  it 'crawls the release page' do
+    album_page = BcCrawler::Release.new(@test_release_url)
+    album_page.crawl
+    expect(album_page.title).to eq('Amanda Palmer Performs The Popular Hits Of Radiohead On Her Magical Ukulele')
+  end
+  it 'stores the trackinfo' do
+    album_page = BcCrawler::Release.new(@test_release_url)
+    album_page.crawl
+    expect(album_page.tracks.first.track_num).to be == 1
+  end
+end

data/spec/spec_helper.rb ADDED

	@@ -0,0 +1 @@
1	+ require 'bc_crawler'

data/tasks/rspec.rake ADDED

@@ -0,0 +1,3 @@
+require 'rspec/core/rake_task'
+RSpec::Core::RakeTask.new(:spec)

metadata ADDED

@@ -0,0 +1,133 @@
+--- !ruby/object:Gem::Specification
+name: bc_crawler
+version: !ruby/object:Gem::Version
+  version: 0.0.4
+  prerelease:
+platform: ruby
+authors:
+- Mario Schuettel
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2015-01-03 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.6'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.6'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: json
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 1.8.1
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 1.8.1
+description: Allows to crawl bandcamp sites, including release and track information
+email:
+- github@lxxxvi.ch
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- bc_crawler.gemspec
+- lib/bc_crawler.rb
+- lib/bc_crawler/helper.rb
+- lib/bc_crawler/main.rb
+- lib/bc_crawler/release.rb
+- lib/bc_crawler/track.rb
+- lib/bc_crawler/version.rb
+- spec/bc_crawler_spec.rb
+- spec/spec_helper.rb
+- tasks/rspec.rake
+homepage: ''
+licenses:
+- MIT
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+      segments:
+      - 0
+      hash: 4069188658231555620
+required_rubygems_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+      segments:
+      - 0
+      hash: 4069188658231555620
+requirements: []
+rubyforge_project:
+rubygems_version: 1.8.24
+signing_key:
+specification_version: 3
+summary: Crawl Bandcamp Sites
+test_files:
+- spec/bc_crawler_spec.rb
+- spec/spec_helper.rb