RubyGems - bc_crawler - Versions diffs - 0.0.4 - Mend

bc_crawler 0.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

data/.gitignore ADDED

@@ -0,0 +1,22 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp
+*.bundle
+*.so
+*.o
+*.a
+mkmf.log

data/Gemfile ADDED

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in bc_crawler.gemspec
+gemspec

data/LICENSE.txt ADDED

@@ -0,0 +1,22 @@
+Copyright (c) 2014 Mario Schuettel
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,252 @@
+# BcCrawler
+A simple Ruby Gem to crawl bandcamp.com sites. It will load information about the artist/label/band, their releases (albums) and all tracks.
+## Installation
+Add this line to your application's Gemfile:
+    gem 'bc_crawler'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install bc_crawler
+## Usage
+### Crawl an artist/label/band
+```ruby
+require 'bc_crawler'
+main = BcCrawler::Main.new('https://amandapalmer.bandcamp.com/')
+ => URL: https://amandapalmer.bandcamp.com/
+main.releases.first
+ =>  URL : https://amandapalmer.bandcamp.com//album/an-evening-with-neil-gaiman-and-amanda-palmer
+    Data :
+```
+Initially, the data attribute is empty, because only the "main"-page has been crawled.
+### Crawl a release
+```ruby
+main.releases.first.crawl
+main.releases.first
+ =>  URL : https://amandapalmer.bandcamp.com//album/an-evening-with-neil-gaiman-and-amanda-palmer
+    Data : { Hash }
+```
+### Crawl all releases from an artist/label/band at once
+```ruby
+main.crawl
+# Crawling https://amandapalmer.bandcamp.com//album/an-evening-with-neil-gaiman-and-amanda-palmer
+# Crawling https://amandapalmer.bandcamp.com//album/theatre-is-evil-2
+# Crawling https://amandapalmer.bandcamp.com//album/amanda-palmer-goes-down-under
+# Crawling https://amandapalmer.bandcamp.com//album/amanda-palmer-performs-the-popular-hits-of-radiohead-on-her-magical-ukulele
+# Crawling https://amandapalmer.bandcamp.com//album/nighty-night
+# Crawling https://amandapalmer.bandcamp.com//album/who-killed-amanda-palmer
+# Crawling https://amandapalmer.bandcamp.com//album/who-killed-amanda-palmer-alternate-tracks
+# Crawling https://amandapalmer.bandcamp.com//album/map-of-tasmania-the-remix-project
+# Crawling https://amandapalmer.bandcamp.com//album/7-series-part-3
+```
+Certain information about releases and tracks can directly be accessed by attributes.
+### Release information
+```ruby
+release = main.releases.first
+release.artist
+ => "Neil Gaiman and Amanda Palmer"
+release.band_id
+ => 3463798201
+release.type
+ => "album"
+release.title
+ => "An Evening With Neil Gaiman and Amanda Palmer"
+release.id              # "Relase ID"
+ => 3510389344
+release.release_date
+ => "19 Nov 2013 00:00:00 GMT"
+release.featured_track_id
+ => 658956410
+release.about
+ => nil
+release.credits
+ => nil
+release.art_fullsize_url
+ => "https://f1.bcbits.com/img/a3489132960_10.jpg"
+release.art_thumb_url
+ => "https://f1.bcbits.com/img/a3489132960_3.jpg"
+release.art_id
+ => nil
+release.has_audio
+ => true
+release.purchase_url
+ => nil
+```
+A release holds one track or more in an array. Each track has these attributes
+### Track information
+```ruby
+random_track = release.tracks[rand(0..release.tracks.count)]
+random_track.id         # "Track ID"
+ => 658956410
+random_track.track_num
+ => 32
+random_track.title
+ => "Judy Blume"
+random_track.duration
+ => 395.093
+random_track.url
+ => "https://amandapalmer.bandcamp.com//track/judy-blume-2"
+random_track.is_downloadable
+ => true
+random_track.streaming
+ => 1
+random_track.file
+ => {"mp3-128"=>"http://popplers5.bandcamp.com/download/track?enc=mp3-128&fsig=6667d236f0f0128472b2d505feb8f43a&id=658956410&stream=1&ts=1417597933.0"}
+random_track.is_draft
+ => false
+random_track.title_link
+ => "/track/judy-blume-2"
+```
+If the information above is not enough, you can access the entire data object from Bandcamp in the release.data attribute
+release.data structure
+```JSON
+{
+  "artFullsizeUrl": "https://f1.bcbits.com/img/a3489132960_10.jpg",
+  "artThumbURL": "https://f1.bcbits.com/img/a3489132960_3.jpg",
+  "current": {
+      "is_set_price": null,
+      "purchase_title": null,
+      "minimum_price_nonzero": 10,
+      "killed": null,
+      "publish_date": "07 Nov 2013 15:27:37 GMT",
+      "mod_date": "22 Nov 2013 20:01:15 GMT",
+      "art_id": 3489132960,
+      "minimum_price": 10,
+      "featured_track_id": 658956410,
+      "auto_repriced": null,
+      "require_email": null,
+      "download_pref": 2,
+      "title": "An Evening With Neil Gaiman and Amanda Palmer",
+      "new_desc_format": 1,
+      "about": null,
+      "require_email_0": null,
+      "private": null,
+      "artist": "Neil Gaiman and Amanda Palmer",
+      "id": 3510389344,
+      "band_id": 3463798201,
+      "credits": null,
+      "upc": null,
+      "set_price": 7,
+      "new_date": "07 Nov 2013 14:50:34 GMT",
+      "type": "album",
+      "purchase_url": null,
+      "release_date": "19 Nov 2013 00:00:00 GMT",
+      "download_desc_id": null
+  },
+  "hasAudio": true,
+  "trackinfo": [
+    "(all tracks go here... see 'trackinfo')"
+  ],
+  "url": "http://amandapalmer.bandcamp.com/album/an-evening-with-neil-gaiman-and-amanda-palmer"
+}
+```
+Assuming you want the "minimum_price" of a release
+```ruby
+release.data['current']['minimum_price']
+ => 10.0
+```
+The "trackinfo" in release.data looks like this
+```JSON
+{
+    "video_poster_url": null,
+    "is_draft": false,
+    "title_link": "/track/my-last-landlady-3",
+    "download_tooltip": "",
+    "video_caption": null,
+    "has_lyrics": false,
+    "sizeof_lyrics": 0,
+    "duration": 391.821,
+    "license_type": 1,
+    "video_featured": null,
+    "has_info": false,
+    "title": "My Last Landlady",
+    "video_source_type": null,
+    "track_num": 1,
+    "private": null,
+    "alt_link": null,
+    "video_id": null,
+    "is_downloadable": false,
+    "video_source_id": null,
+    "lyrics": null,
+    "album_preorder": false,
+    "id": 1844797083,
+    "encoding_error": null,
+    "has_free_download": null,
+    "video_mobile_url": null,
+    "streaming": 1,
+    "unreleased_track": false,
+    "file": {
+        "mp3-128": "http://popplers5.bandcamp.com/download/track?enc=mp3-128&fsig=25ddaa2b8fa8a008562e4e0c6efc2eff&id=1844797083&stream=1&ts=1417597933.0"
+    },
+    "encoding_pending": null,
+    "free_album_download": false,
+    "encodings_id": 3584714018
+}
+```
+Assuming you want to know if the first track of a release "has_lyrics":
+```ruby
+    release.data['trackinfo'][0]['has_lyrics']
+     => false
+```
+## Contributing
+1. Fork it ( https://github.com/[my-github-username]/bc_crawler/fork )
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create a new Pull Request

data/Rakefile ADDED

@@ -0,0 +1,3 @@
+require "bundler/gem_tasks"
+Dir.glob('tasks/**/*.rake').each(&method(:import))

data/bc_crawler.gemspec ADDED

@@ -0,0 +1,26 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'bc_crawler/version'
+Gem::Specification.new do |spec|
+  spec.name          = 'bc_crawler'
+  spec.version       = BcCrawler::VERSION
+  spec.authors       = ['Mario Schuettel']
+  spec.email         = ["github@lxxxvi.ch"]
+  spec.summary       = 'Crawl Bandcamp Sites'
+  spec.description   = 'Allows to crawl bandcamp sites, including release and track information'
+  spec.homepage      = ''
+  spec.license       = 'MIT'
+  spec.files         = `git ls-files -z`.split("\x0")
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ['lib']
+  spec.add_development_dependency "bundler", "~> 1.6"
+  spec.add_development_dependency "rake"
+  spec.add_development_dependency 'rspec'
+  spec.add_runtime_dependency 'json', '>= 1.8.1'
+end

data/lib/bc_crawler.rb ADDED

@@ -0,0 +1,11 @@
+require 'open-uri'
+require 'set'
+require 'json'
+require 'bc_crawler/version'
+require 'bc_crawler/helper'
+require 'bc_crawler/main'
+require 'bc_crawler/release'
+require 'bc_crawler/track'
+module BcCrawler
+end

data/lib/bc_crawler/helper.rb ADDED

@@ -0,0 +1,7 @@
+module BcCrawler
+  class Helper
+    def self.get_base_url(url)
+      url[/https?:\/\/(.*?)\//]
+    end
+  end
+end

data/lib/bc_crawler/main.rb ADDED

@@ -0,0 +1,45 @@
+# BcCrawler (Bandcamp Crawler) can be used to fetch release data
+# from a given artist, band or label on bandcamp.com.
+# It will fetch the main information such as band name, release name,
+# track name, track duration, track number, etc.
+module BcCrawler
+  class Main
+    attr_accessor :releases, :url
+    def initialize(url)
+      @url        = url
+      @releases   = []
+      # call the page
+      html = open(@url).read
+      release_paths = Set.new
+      # get all "a" elements that target an /album/... URL
+      html.scan(/<a href="\/album\/(.*?)"/).each { |r| release_paths << "/album/#{r.first}" }
+      # TODO: implement single tracks, that are not assigned to an album, but directly to the artist
+      # initialize the release(s)
+      release_paths.each do |path|
+        @releases << BcCrawler::Release.new("#{ @url }#{ path }")
+      end
+    end
+    def crawl
+      # fetch information about the release
+      @releases.each do |release|
+        release.crawl
+      end
+    end
+    def to_s
+    <<-EOF
+    URL : #{ @url }
+    Number of releases : #{ @releases.count }
+    EOF
+    end
+  end
+end

data/lib/bc_crawler/release.rb ADDED

@@ -0,0 +1,87 @@
+module BcCrawler
+  class Release
+    attr_reader :art_fullsize_url, :art_thumb_url, :art_id, :about, :featured_track_id,
+                :credits, :artist, :purchase_url, :band_id, :id, :release_date,
+                :type, :title, :tracks, :has_audio, :url, :html, :data
+    def initialize(url)
+      @url = url
+      @tracks = []
+    end
+    # Scan the HTML for a particular JavaScript snippet where a variable named "TralbumData" is assigned.
+    # TralbumData contains all information about the release (and its tracks), but has to be cleaned first
+    # in order to get a valid JSON object.
+    #
+    # By default, only the main nodes in TralbumData are crawled. There are more nodes available.
+    #
+    #   nodes = %w(album_is_preorder album_release_date artFullsizeUrl artist artThumbURL
+    #              current defaultPrice featured_track_id FREE freeDownloadPage hasAudio
+    #              id initial_track_num is_preorder item_type last_subscription_item
+    #              maxPrice minPrice packages PAID playing_from preorder_count trackinfo url)
+    def crawl(nodes = %w(artFullsizeUrl artThumbURL current hasAudio trackinfo url))
+      puts "Crawling #{@url}"
+      @nodes = nodes
+      # call the URL, fetch the JavaScript code (TralbumData) and clean the string
+      @html = open(@url).read
+      js_content = html.gsub(/\n/, '~~')[/var TralbumData = \{(.*?)\};/, 1] # get content of JS variable TralbumData
+                       .gsub('~~', "\n")                                  # undo line endings replacement
+                       .gsub("\t", '')                                    # remove tabs
+                       .gsub("\" + \"", '')                               # special bug in "url" node
+      # scan the JavaScript code text for the given nodes
+      json_nodes = []
+      @nodes.each do |node|
+        json_nodes << js_content[/^( )*#{node}( )*:.*$/]                  # fetch current node in JavaScript object
+                               .gsub(/#{node}/, "\"#{node}\"")            # add double quotes around node name
+                               .gsub(/( )*,( )*$/, '')                    # remove empty lines with comma
+      end
+      @data = JSON.parse("{ #{ json_nodes.join(', ') } }")
+      # Finally, we load the release info
+      load_release_info
+    end
+    # Assign some of the  main information to instance variables
+    # TODO: make ALL information available as instance variables
+    def load_release_info
+      @art_fullsize_url   = @data['artFullsizeUrl']
+      @art_thumb_url      = @data['artThumbURL']
+      @art_id             = @data['current']['art_it']
+      @about              = @data['current']['about']
+      @featured_track_id  = @data['current']['featured_track_id']
+      @credits            = @data['current']['credits']
+      @artist             = @data['current']['artist']
+      @purchase_url       = @data['current']['purchase_url']
+      @band_id            = @data['current']['band_id']
+      @id                 = @data['current']['id']
+      @release_date       = @data['current']['release_date']
+      @type               = @data['current']['type']
+      @title              = @data['current']['title']
+      @has_audio          = @data['hasAudio']
+      load_track_info
+    end
+    # Tracks have their own class
+    def load_track_info
+      @data['trackinfo'].each do |track|
+        @tracks << Track.new(self, track)
+      end
+    end
+    def to_s
+      <<-EOF
+      URL : #{ @url }
+      Artist : #{ @artist }
+      Release title : #{ @title }
+      Number of tracks : #{ @tracks.count }
+      #{ '(use .crawl method to fetch the missing data)' if @artist.nil? }
+      EOF
+    end
+  end
+end

data/lib/bc_crawler/track.rb ADDED

@@ -0,0 +1,30 @@
+module BcCrawler
+  class Track
+    attr_reader :duration, :track_num, :is_downloadable, :streaming,
+                :is_draft, :id, :title_link, :file, :title, :url
+    def initialize(release, track)
+      @release            = release
+      @duration           = track['duration']
+      @track_num          = track['track_num']
+      @is_downloadable    = track['is_downloadable']
+      @streaming          = track['streaming']
+      @is_draft           = track['is_draft']
+      @id                 = track['id']
+      @title_link         = track['title_link']
+      @file               = track['file']
+      @title              = track['title']
+      @url                = "#{ BcCrawler::Helper.get_base_url(@release.url) }#{ track['title_link'] }"
+    end
+    def to_s
+      <<-EOF
+      URL : #{ @url }
+      Track number : #{ @track_num }
+      Track name : #{ @title }
+      Duration : #{ @duration }
+      EOF
+    end
+  end
+end

data/lib/bc_crawler/version.rb ADDED

@@ -0,0 +1,3 @@
+module BcCrawler
+  VERSION = '0.0.4'
+end

data/spec/bc_crawler_spec.rb ADDED

@@ -0,0 +1,30 @@
+require 'spec_helper'
+describe BcCrawler do
+  before(:all) do
+    @test_release_url = 'http://amandapalmer.bandcamp.com/album/amanda-palmer-performs-the-popular-hits-of-radiohead-on-her-magical-ukulele'
+  end
+  it 'returns the base url' do
+    base_url = BcCrawler::Helper.get_base_url('https://abc.bandcamp.com/album/of-the-year')
+    expect(base_url).to eq('https://abc.bandcamp.com/')
+  end
+  it 'crawls the main page' do
+    main_page = BcCrawler::Main.new('http://amandapalmer.bandcamp.com/')
+    expect(main_page.releases.count).to be > 0
+  end
+  it 'crawls the release page' do
+    album_page = BcCrawler::Release.new(@test_release_url)
+    album_page.crawl
+    expect(album_page.title).to eq('Amanda Palmer Performs The Popular Hits Of Radiohead On Her Magical Ukulele')
+  end
+  it 'stores the trackinfo' do
+    album_page = BcCrawler::Release.new(@test_release_url)
+    album_page.crawl
+    expect(album_page.tracks.first.track_num).to be == 1
+  end
+end

data/spec/spec_helper.rb ADDED

	@@ -0,0 +1 @@
1	+ require 'bc_crawler'

data/tasks/rspec.rake ADDED

@@ -0,0 +1,3 @@
+require 'rspec/core/rake_task'
+RSpec::Core::RakeTask.new(:spec)

metadata ADDED

@@ -0,0 +1,133 @@
+--- !ruby/object:Gem::Specification
+name: bc_crawler
+version: !ruby/object:Gem::Version
+  version: 0.0.4
+  prerelease:
+platform: ruby
+authors:
+- Mario Schuettel
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2015-01-03 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.6'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.6'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: json
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 1.8.1
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: 1.8.1
+description: Allows to crawl bandcamp sites, including release and track information
+email:
+- github@lxxxvi.ch
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- bc_crawler.gemspec
+- lib/bc_crawler.rb
+- lib/bc_crawler/helper.rb
+- lib/bc_crawler/main.rb
+- lib/bc_crawler/release.rb
+- lib/bc_crawler/track.rb
+- lib/bc_crawler/version.rb
+- spec/bc_crawler_spec.rb
+- spec/spec_helper.rb
+- tasks/rspec.rake
+homepage: ''
+licenses:
+- MIT
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+      segments:
+      - 0
+      hash: 4069188658231555620
+required_rubygems_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+      segments:
+      - 0
+      hash: 4069188658231555620
+requirements: []
+rubyforge_project:
+rubygems_version: 1.8.24
+signing_key:
+specification_version: 3
+summary: Crawl Bandcamp Sites
+test_files:
+- spec/bc_crawler_spec.rb
+- spec/spec_helper.rb