RubyGems - puppet-community-mvp - Versions diffs - 0.0.1 - Mend

puppet-community-mvp 0.0.1

Files changed (11) hide show

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 8ed5308091443f5847159a6a481611fba281f4d5
+  data.tar.gz: f7bb0dd50ea248c04b5809144d67355ad7c6c202
+SHA512:
+  metadata.gz: ececdc2a2121c4054fc49b16385892e78364b9ed197b2ac3e38a5542de2f5be94cf52ac9e6d6e1590c7e91b912fa54f4e9a70e71e60ae831fe545b98731021ee
+  data.tar.gz: 5f87defac101d2105403c0b5b54d34c37cac1b552fc77a2008d7d62624df41c8d1d4813be9c2cf15ded62942d6e0cbc70f36862d39866caa2bd77d6d9528aba0

data/LICENSE ADDED

@@ -0,0 +1,202 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

data/README.md ADDED

File without changes

data/bin/mvp ADDED

@@ -0,0 +1,121 @@
+#! /usr/bin/env ruby
+require 'rubygems'
+require 'optparse'
+require 'yaml'
+require 'fileutils'
+require 'logger'
+require 'mvp'
+NAME     = File.basename($PROGRAM_NAME)
+options  = {:config => File.expand_path('~/.mvp.config.yaml')}
+optparse = OptionParser.new { |opts|
+  opts.banner = "Usage : #{NAME} [command] [target] [options]
+This tool will scrape the Puppet Forge API for interesting module & author stats.
+The following CLI commands are available.
+  * get | retrieve | download [target]
+      * Downloads and caches all Forge metadata.
+      * Optional targets: all, authors, modules, releases
+  * upload | insert [target]
+      * Uploads data to BigQuery
+      * Optional targets: all, authors, modules, releases, mirrors
+  * stats
+      * Print out a summary of interesting stats.
+"
+  opts.on("-f FORGEAPI", "--forgeapi FORGEAPI", "Forge API server. Rarely needed.") do |arg|
+    options[:forgeapi] = arg
+  end
+  opts.on("-c config", "--config CONFIG", "Location of config.yaml.") do |arg|
+    options[:config] = File.expand_path(arg)
+  end
+  opts.on("-C CACHEDIR", "--cachedir CACHEDIR", "Where data should be cached.") do |arg|
+    options[:cachedir] = arg
+  end
+  opts.on("-g GITHUB_DATA", "--github_data GITHUB_DATA", "The path to a csv file containing GitHub repos & stars.") do |arg|
+    options[:github_data] = arg
+  end
+  opts.on("--project PROJECT", "The gcloud project to use.") do |arg|
+    options[:gcloud][:project] = arg
+  end
+  opts.on("--dataset DATASET", "The gcloud dataset to use.") do |arg|
+    options[:gcloud][:dataset] = arg
+  end
+  opts.on("--keyfile KEYFILE", "The gcloud keyfile to use.") do |arg|
+    options[:gcloud][:keyfile] = arg
+  end
+  opts.on("-o OUTPUT_FILE", "--output_file OUTPUT_FILE", "The path to save a csv report.") do |arg|
+    options[:output_file] = arg
+  end
+  opts.on("-d", "--debug", "Display extra debugging information.") do
+    options[:debug] = true
+  end
+  opts.separator('')
+  opts.on("-h", "--help", "Displays this help") do
+    puts opts
+    exit
+  end
+}
+optparse.parse!
+options = (YAML.load_file(options[:config]) rescue {}).merge(options)
+options[:cachedir]         ||= '~/.mvp/cache'
+options[:forgeapi]         ||= 'https://forgeapi.puppet.com'
+options[:gcloud]           ||= {}
+options[:gcloud][:dataset] ||= 'community'
+options[:gcloud][:project] ||= 'puppet'
+options[:gcloud][:keyfile] ||= '~/.mvp/credentials.json'
+options[:cachedir]           = File.expand_path(options[:cachedir])
+options[:gcloud][:keyfile]   = File.expand_path(options[:gcloud][:keyfile])
+FileUtils.mkdir_p(options[:cachedir])
+$logger           = Logger::new(STDOUT)
+$logger.level     = options[:debug] ? Logger::DEBUG : Logger::INFO
+$logger.formatter = proc { |severity,datetime,progname,msg| "#{severity}: #{msg}\n" }
+runner = Mvp::Runner.new(options)
+command, target = ARGV
+case command
+when 'get', 'retrieve', 'download'
+  target ||= :all
+  runner.retrieve(target.to_sym)
+when 'transform'
+  target ||= :all
+  runner.retrieve(target.to_sym, false)
+when 'insert', 'upload'
+  target ||= :all
+  runner.upload(target.to_sym)
+when 'mirror'
+  target ||= :all
+  runner.mirror(target.to_sym)
+when 'stats'
+  target ||= :all
+  runner.stats(target.to_sym)
+when 'test'
+  runner.test
+else
+  puts "Unknown command: #{command}"
+  puts "Run #{NAME} -h for usage."
+  exit 1
+end

data/lib/mvp.rb ADDED

@@ -0,0 +1,4 @@
+require 'mvp/runner'
+require 'mvp/downloader'
+require 'mvp/uploader'
+require 'mvp/stats'

data/lib/mvp/downloader.rb ADDED

@@ -0,0 +1,199 @@
+require 'json'
+require 'httparty'
+require 'tty-spinner'
+require 'semantic_puppet'
+require 'mvp/monkeypatches'
+class Mvp
+  class Downloader
+    def initialize(options = {})
+      @cachedir = options[:cachedir]
+      @forgeapi = options[:forgeapi] ||'https://forgeapi.puppet.com'
+    end
+    def retrieve(entity, download = true)
+      if download
+        # I am focusing on authorship rather than just users, so for now I'm using the word authors
+        item = (entity == :authors) ? 'users' : entity.to_s
+        data = download(item)
+        save_json(entity, data)
+      else
+        data = File.read("#{@cachedir}/#{entity}.json")
+      end
+      case entity
+      when :modules
+        data = flatten_modules(data)
+      when :releases
+        data = flatten_releases(data)
+      end
+      save_nld_json(entity.to_s, data)
+    end
+    def validations()
+      results = {}
+      cache   = "#{@cachedir}/modules.json"
+      if File.exist? cache
+        module_data = JSON.parse(File.read(cache))
+      else
+        module_data = retrieve(:modules)
+      end
+      begin
+        offset   = 0
+        endpoint = "/private/validations/"
+        spinner  = TTY::Spinner.new("[:spinner] :title")
+        spinner.update(title: "Downloading module validations ...")
+        spinner.auto_spin
+        module_data.each do |mod|
+          name = "#{mod['owner']['username']}-#{mod['name']}"
+          response = HTTParty.get("#{@forgeapi}#{endpoint}#{name}", headers: {"User-Agent" => "Puppet Community Stats Monitor"})
+          raise "Forge Error: #{@response.body}" unless response.code == 200
+          data          = JSON.parse(response.body)
+          offset       += 1
+          results[name] = data
+          spinner.update(title: "Downloading module validations [#{offset}]...") if (offset % 25 == 0)
+        end
+        spinner.success('(OK)')
+      rescue => e
+        spinner.error('API error')
+        $logger.error e.message
+        $logger.debug e.backtrace.join("\n")
+      end
+      save_json('validations', results)
+      save_nld_json('validations', flatten_validations(results))
+      results
+    end
+    def download(entity)
+      results = []
+      begin
+        offset   = 0
+        endpoint = "/v3/#{entity}?sort_by=downloads&limit=50"
+        spinner  = TTY::Spinner.new("[:spinner] :title")
+        spinner.update(title: "Downloading #{entity} ...")
+        spinner.auto_spin
+        while endpoint do
+          response = HTTParty.get("#{@forgeapi}#{endpoint}", headers: {"User-Agent" => "Puppet Community Stats Monitor"})
+          raise "Forge Error: #{@response.body}" unless response.code == 200
+          data = JSON.parse(response.body)
+          offset  += 50
+          results += data['results']
+          endpoint = data['pagination']['next']
+          spinner.update(title: "Downloading #{entity} [#{offset}]...") if (endpoint and (offset % 250 == 0))
+        end
+        spinner.success('(OK)')
+      rescue => e
+        spinner.error('API error')
+        $logger.error e.message
+        $logger.debug e.backtrace.join("\n")
+      end
+      munge_dates(results)
+    end
+    # transform dates into a format that bigquery knows
+    def munge_dates(object)
+      ["created_at", "updated_at", "deprecated_at", "deleted_at"].each do |field|
+        next unless object.first.keys.include? field
+        object.each do |record|
+          next unless record[field]
+          record[field] = DateTime.parse(record[field]).strftime("%Y-%m-%d %H:%M:%S")
+        end
+      end
+      object
+    end
+    def save_json(thing, data)
+      File.write("#{@cachedir}/#{thing}.json", data.to_json)
+    end
+    # store data in a way that bigquery can grok
+    # uploading files is far easier than streaming data, when replacing a dataset
+    def save_nld_json(thing, data)
+      File.write("#{@cachedir}/nld_#{thing}.json", data.to_newline_delimited_json)
+    end
+    def flatten_modules(data)
+      data.each do |row|
+        row['owner']             = row['owner']['username']
+        row['superseded_by']     = row['superseded_by']['slug'] rescue nil
+        row['pdk']               = row['current_release']['pdk']
+        row['supported']         = row['current_release']['supported']
+        row['version']           = row['current_release']['version']
+        row['validation_score']  = row['current_release']['validation_score']
+        row['license']           = row['current_release']['metadata']['license']
+        row['source']            = row['current_release']['metadata']['source']
+        row['project_page']      = row['current_release']['metadata']['project_page']
+        row['issues_url']        = row['current_release']['metadata']['issues_url']
+        row['tasks']             = row['current_release']['tasks'].map{|task| task['name']}
+        row['release_count']     = row['releases'].count rescue 0
+        row['releases']          = row['releases'].map{|r| r['version']} rescue []
+        simplify_metadata(row, row['current_release']['metadata'])
+        row.delete('current_release')
+      end
+      data
+    end
+    def flatten_releases(data)
+      data.each do |row|
+        row['name']              = row['module']['name']
+        row['owner']             = row['module']['username']
+        row['license']           = row['metadata']['license']
+        row['source']            = row['metadata']['source']
+        row['project_page']      = row['metadata']['project_page']
+        row['issues_url']        = row['metadata']['issues_url']
+        row['tasks']             = row['tasks'].map{|task| task['name']}
+        simplify_metadata(row, row['metadata'])
+        row.delete('module')
+      end
+      data
+    end
+    def flatten_validations(data)
+      data.map do |name, scores|
+        row = { 'name' => name }
+        scores.each do |entry|
+          row[entry['name']] = entry['score']
+        end
+        row
+      end
+    end
+    def simplify_metadata(data, metadata)
+      data['operatingsystem']   = metadata['operatingsystem_support'].map{|i| i['operatingsystem']}                       rescue nil
+      data['dependencies']      = metadata['dependencies'].map{|i| i['name']}                                             rescue nil
+      data['puppet_range']      = metadata['requirements'].select{|r| r['name'] == 'puppet'}.first['version_requirement'] rescue nil
+      data['metadata']          = metadata.to_json
+      if data['puppet_range']
+        range = SemanticPuppet::VersionRange.parse(data['puppet_range'])
+        data['puppet_2x']       = range.include? SemanticPuppet::Version.parse('2.99.99')
+        data['puppet_3x']       = range.include? SemanticPuppet::Version.parse('3.99.99')
+        data['puppet_4x']       = range.include? SemanticPuppet::Version.parse('4.99.99')
+        data['puppet_5x']       = range.include? SemanticPuppet::Version.parse('5.99.99')
+        data['puppet_6x']       = range.include? SemanticPuppet::Version.parse('6.99.99')
+      end
+    end
+    def test()
+      require 'pry'
+      binding.pry
+    end
+  end
+end

data/lib/mvp/monkeypatches.rb ADDED

@@ -0,0 +1,8 @@
+# BigQuery uses newline delimited json
+# https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON
+class Array
+  def to_newline_delimited_json
+    self.map(&:to_json).join("\n")
+  end
+end

data/lib/mvp/runner.rb ADDED

@@ -0,0 +1,54 @@
+require 'mvp/downloader'
+require 'mvp/uploader'
+require 'mvp/stats'
+class Mvp
+  class Runner
+    def initialize(options = {})
+      @cachedir = options[:cachedir]
+      @debug    = options[:debug]
+      @options  = options
+    end
+    def retrieve(target = :all, download = true)
+      downloader = Mvp::Downloader.new(@options)
+      [:authors, :modules, :releases].each do |thing|
+        next unless [:all, thing].include? target
+        downloader.retrieve(thing, download)
+      end
+      if [:all, :validations].include? target
+        downloader.validations()
+      end
+    end
+    def upload(target = :all)
+      uploader = Mvp::Uploader.new(@options)
+      [:authors, :modules, :releases, :validations, :mirrors].each do |thing|
+        next unless [:all, thing].include? target
+        uploader.send(thing)
+      end
+    end
+    def mirror(target = :all)
+      retrieve(target)
+      upload(target)
+    end
+    def stats(target)
+      stats = Mvp::Stats.new(@options)
+      [:authors, :modules, :releases, :relationships, :github, :validations].each do |thing|
+        next unless [:all, thing].include? target
+        stats.send(thing)
+      end
+    end
+    def test()
+      require 'pry'
+      binding.pry
+    end
+  end
+end

data/lib/mvp/stats.rb ADDED

@@ -0,0 +1,339 @@
+require 'json'
+require 'histogram'
+require 'ascii_charts'
+require 'histogram/array'
+require 'sparkr'
+class Mvp
+  class Stats
+    def initialize(options = {})
+      @cachedir    = options[:cachedir]
+      @today       = Date.today
+      @github_data = options[:github_data]
+      @output_file = options[:output_file]
+    end
+    def load(entity)
+      JSON.parse(File.read("#{@cachedir}/#{entity}.json"))
+    end
+    def draw_graph(series, width, title = nil)
+      series.compact!
+      graph    = []
+      (bins, freqs) = series.histogram(:bin_width => width)
+      bins.each_with_index do |item, index|
+        graph << [ item, freqs[index] ]
+      end
+      puts AsciiCharts::Cartesian.new(graph, :bar => true, :hide_zero => true, :title => title).draw
+    end
+    # TODO: improve this to discard outliers and slightly weight larger series
+    def average(series)
+      series.compact!
+      return 0 if series.empty?
+      series.inject(0.0) { |sum, el| sum + el } / series.size
+    end
+    def days_ago(datestr)
+      @today - Date.parse(datestr)
+    end
+    def years_ago(datestr)
+      days_ago(datestr)/365
+    end
+    def tally_author_info(releases, target, scope='module_count')
+      # update the author records with the fields we need
+      target.each do |author|
+        author['release_dates'] = []
+        author['scores']        = []
+      end
+      releases.each do |mod|
+        username = mod['module']['owner']['username']
+        score    = mod['validation_score']
+        author   = target.select{|m| m['username'] == username}.first
+        author['release_dates']  << mod['created_at']
+        author['scores']         << score if score
+      end
+      target.each do |author|
+        author['average']        = average(author['scores']).to_i
+        author['impact']         = author['average'] * author[scope]
+        author['newest_release'] = author['release_dates'].max_by {|r| Date.parse(r) }
+        author['oldest_release'] = author['release_dates'].min_by {|r| Date.parse(r) }
+      end
+    end
+    def authors()
+      data     = load('authors').reject {|u| u['username'] == 'puppetlabs' }
+      casual   = data.select {|u| (2...10).include? u['module_count'] }
+      prolific = data.select {|u| u['module_count'] > 9}
+      topmost  = data.sort_by {|u| u['module_count']}.reverse[0...20]
+      releases = data.sort_by {|u| u['release_count']}.reverse[0...20]
+      puts "* Prolific in this case is more than 9 released modules."
+      draw_graph(casual.map {|u| u['module_count']},   1, 'Number of modules from casual authors')
+      draw_graph(prolific.map {|u| u['module_count']}, 5, 'Number of modules from prolific authors')
+      puts
+      puts
+      puts "Author Statistics:"
+      puts "  └── Number of users:                                #{data.count}"
+      puts "  └── Number who have never published a module:       #{data.select {|u| u['module_count'] == 0}.count}"
+      puts "  └── Number who have published a single module:      #{data.select {|u| u['module_count'] == 1}.count}"
+      puts "  └── Number who have published multiple modules:     #{data.select {|u| u['module_count']  > 1}.count}"
+      puts "  └── Number who have published two modules:          #{data.select {|u| u['module_count'] == 2}.count}"
+      puts "  └── Number who have published more than 5 modules:  #{data.select {|u| u['module_count']  > 5}.count}"
+      puts "  └── Number who have published more than 10 modules: #{data.select {|u| u['module_count']  > 10}.count}"
+      puts "  └── Number who have published more than 20 modules: #{data.select {|u| u['module_count']  > 20}.count}"
+      puts "  └── Number who have published more than 30 modules: #{data.select {|u| u['module_count']  > 30}.count}"
+      puts "  └── Number who have published more than 50 modules: #{data.select {|u| u['module_count']  > 50}.count}"
+      puts
+      puts "Top 20 prolific module authors by number of modules | number of releases:"
+      topmost.each do |author|
+        puts "  └── %-55s: %d | %d" % [ "#{author['display_name']} (#{author['username']})",
+                                        author['module_count'],
+                                        author['release_count'] ]
+      end
+      puts
+      puts "Top 20 active module authors by number of releases | number of modules:"
+      releases.each do |author|
+        puts "  └── %-55s: %d | %d" % [ "#{author['display_name']} (#{author['username']})",
+                                        author['release_count'],
+                                        author['module_count'] ]
+      end
+    end
+    def modules()
+      data_m  = load('modules').reject {|m| m['owner']['username'] == 'puppetlabs' }
+      data_a  = load('authors').reject {|u| u['username'] == 'puppetlabs' or u['module_count'] == 0}
+      current = data_m.map {|m| m['current_release'] }
+      tally_author_info(current, data_a, 'module_count')
+      prolific  = data_a.select{|a| a['impact']>1000}.sort_by {|a| a['impact']}
+      topmost   = data_a.sort_by {|a| a['impact']}.reverse[0...20]
+      published = data_a.reject {|u| u['newest_release'].nil?}
+      puts '* Validation score is a Forge ranking based on the scores of an individual module release.'
+      puts "* I am defining impact as an author's average validation * the number of modules releases they've made / 100."
+      puts "* Prolific in this case is impact > 100."
+      draw_graph(current.map {|m| years_ago(m['created_at']).round(1)},       0.5, 'Age (in years) distribution by module')
+      draw_graph(published.map {|m| years_ago(m['newest_release']).round(1)}, 0.5, "Distribution of author's newest module by years old")
+      draw_graph(current.map {|m| m['validation_score']},      10, 'Validation score distribution by module')
+      draw_graph(data_a.map {|a| average(a['scores']).to_i },  10, 'Validation score distribution by author')
+      draw_graph(prolific.map {|a| a['impact']/100 },           5, 'Impact distribution by prolific authors')
+      puts
+      puts
+      puts "Module Statistics:"
+      puts "  └── Number of modules:                              #{data_m.count}"
+      puts "  └── Modules less than a year old:                   #{current.select {|m| days_ago(m['created_at']) < 365}.count}"
+      puts "  └── Modules more than a year old:                   #{current.select {|m| days_ago(m['created_at']) > 365}.count}"
+      puts "  └── Modules more than two years old:                #{current.select {|m| years_ago(m['created_at']) > 2}.count}"
+      puts "  └── Modules more than three years old:              #{current.select {|m| years_ago(m['created_at']) > 3}.count}"
+      puts "  └── Modules more than four years old:               #{current.select {|m| years_ago(m['created_at']) > 4}.count}"
+      puts "  └── Modules more than five years old:               #{current.select {|m| years_ago(m['created_at']) > 5}.count}"
+      puts "  └── Authors with 'perfect' validation scores:       #{data_a.select {|u| average(u['scores']).to_i == 100}.count}"
+      puts "  └── Authors who've released in the last year:       #{published.select {|u| days_ago(u['newest_release']) < 365}.count}"
+      puts "  └── Authors with no outdated (1yr) modules:         #{published.select {|u| days_ago(u['oldest_release']) < 365}.count}"
+      puts
+      puts "Top 20 high impact module authors by impact | number of modules:"
+      topmost.each do |author|
+        puts "  └── %-55s: %d | %d" % [ "#{author['display_name']} (#{author['username']})",
+                                        author['impact']/100,
+                                        author['module_count'] ]
+      end
+    end
+    def releases()
+      data_r  = load('releases').reject {|m| m['module']['owner']['username'] == 'puppetlabs' }
+      data_a  = load('authors').reject {|u| u['username'] == 'puppetlabs' or u['module_count'] == 0}
+      tally_author_info(data_r, data_a, 'release_count')
+      impactful = data_a.select{|a| a['impact']>5000}.sort_by {|a| a['impact']}
+      topmost   = data_a.sort_by {|a| a['impact']}.reverse[0...20]
+      published = data_a.reject {|u| u['newest_release'].nil?}
+      multiple  = published.select {|u| u['module_count'] > 1}
+      prolific  = published.select {|u| u['module_count'] > 9}
+      current   = multiple.sort_by {|a| days_ago(a['oldest_release'])}[0...20]
+      # Authors that used to be active, but don't seem to be any more
+      faded = published.select do |author|
+        count_old = author['release_dates'].select {|r| years_ago(r) > 2 }.count
+        count_new = author['release_dates'].select {|r| years_ago(r) < 1.5 }.count
+        (count_old > 25 and count_old > (50*count_new))
+      end
+      oldest = years_ago(faded.map { |u| u['release_dates']}.flatten.max_by {|r| days_ago(r) }).to_i
+      faded.each do |author|
+        author['annual_releases'] = []
+        (1..oldest).each do |age|
+          author['annual_releases'] << author['release_dates'].select {|r| years_ago(r).to_i == age }.count
+        end
+        author['annual_releases'].reverse!
+      end
+      puts '* Validation score is a Forge ranking based on the scores of an individual module release.'
+      puts "* I am defining impact as an author's average validation * the number of modules releases they've made / 100."
+      puts "* Prolific in this case is more than 9 released modules."
+      draw_graph(data_a.map {|a| average(a['scores']).to_i }, 10, 'Validation score distribution by author')
+      draw_graph(impactful.map {|a| a['impact']/100 },        50, 'Impact distribution by impactful authors')
+      puts
+      puts
+      puts "Release Statistics:"
+      puts "  └── Number of releases:                                       #{data_r.count}"
+      puts "  └── Authors with no releases:                                 #{data_a.count - published.count}"
+      puts "  └── Authors with only a single releases:                      #{published.count - multiple.count}"
+      puts "  └── Authors with no releases in one year:                     #{published.select {|m| years_ago(m['newest_release']) >1}.count}"
+      puts "  └── Authors with no releases in two years:                    #{published.select {|m| years_ago(m['newest_release']) >2}.count}"
+      puts "  └── Authors with no releases in three years:                  #{published.select {|m| years_ago(m['newest_release']) >3}.count}"
+      puts "  └── Authors with no releases in four years:                   #{published.select {|m| years_ago(m['newest_release']) >4}.count}"
+      puts "  └── Authors with no releases in five years:                   #{published.select {|m| years_ago(m['newest_release']) >5}.count}"
+      puts "  └── Authors with multiple releases, all newer than a month:   #{multiple.select {|u| days_ago(u['oldest_release']) < 30}.count}"
+      puts "  └── Authors with multiple releases, all newer than 3 months:  #{multiple.select {|u| days_ago(u['oldest_release']) < 90}.count}"
+      puts "  └── Authors with multiple releases, all newer than 6 months:  #{multiple.select {|u| days_ago(u['oldest_release']) < 180}.count}"
+      puts "  └── Authors with multiple releases, all newer than a year:    #{multiple.select {|u| days_ago(u['oldest_release']) < 365}.count}"
+      puts "  └── Prolific authors, with releases all newer than 3 months:  #{prolific.select {|u| days_ago(u['oldest_release']) < 90}.count}"
+      puts "  └── Prolific authors, with releases all newer than 6 months:  #{prolific.select {|u| days_ago(u['oldest_release']) < 180}.count}"
+      puts "  └── Prolific authors, with releases all newer than a year:    #{prolific.select {|u| days_ago(u['oldest_release']) < 365}.count}"
+      puts "  └── Prolific authors, with releases all newer than 2 years:   #{prolific.select {|u| years_ago(u['oldest_release']) < 2}.count}"
+      puts
+      puts "Top 20 high impact module authors by impact | number of releases:"
+      topmost.each do |author|
+        puts "  └── %-55s: %d | %d" % [ "#{author['display_name']} (#{author['username']})",
+                                        author['impact']/100,
+                                        author['release_count'] ]
+      end
+      puts
+      puts "Top 20 current module authors by oldest release | number of releases:"
+      current.each do |author|
+        puts "  └── %-55s: %s | %d" % [ "#{author['display_name']} (#{author['username']})",
+                                        Date.parse(author['oldest_release']).strftime('%v'),
+                                        author['release_count'] ]
+      end
+      puts
+      puts "Authors who are no longer as active as they used to be:"
+      faded.each do |author|
+        puts "  └── %-55s: %s    %s" % [ "#{author['display_name']} (#{author['username']})",
+                                        Sparkr.sparkline(author['annual_releases']),
+                                        author['annual_releases'].to_s ]
+      end
+    end
+    def relationships()
+      data_m  = load('modules').reject {|m| m['owner']['username'] == 'puppetlabs' }
+      data_a  = load('authors').reject {|u| u['username'] == 'puppetlabs' or u['module_count'] == 0}
+      current = data_m.map {|m| m['current_release'] }
+      current.each do |mod|
+          mod['metadata']['dependants'] = []
+      end
+      current.each do |mod|
+        mod['metadata']['dependencies'].each do |dependency|
+          target = current.select {|m| m['metadata']['name'] == dependency['name'].sub('/','-')}.first
+          next unless target
+          target['metadata']['dependants']  <<  mod['metadata']['name']
+        end
+      end
+      data_a.each { |a| a['dependants'] = [] }
+      current.each do |mod|
+        count  = mod['metadata']['dependants'].count
+        next unless count > 0
+        author = data_a.select{|m| m['username'] == mod['module']['owner']['username']}.first
+        author['dependants'] << count
+      end
+      data_a.each { |a| a['average_dependants'] = average(a['dependants']) }
+      top_mods  = current.sort_by {|m| m['metadata']['dependants'].count}.reverse[0...20]
+      connected = data_a.sort_by {|a| a['average_dependants'] }.reverse[0...20]
+      low_conn  = current.select {|m| (2..10).include?  m['metadata']['dependants'].count}
+      high_conn = current.select {|m| m['metadata']['dependants'].count > 10}
+      draw_graph(low_conn.map {|m| m['metadata']['dependants'].count },   1, 'Number of dependent modules for low connection modules')
+      draw_graph(high_conn.map {|m| m['metadata']['dependants'].count }, 10, 'Number of dependent modules for high connection modules')
+      draw_graph(connected.map {|a| a['average_dependants'].to_i }, 5, 'Average number of dependent modules by author')
+      puts
+      puts "Top 20 connected module authors by number of dependants | number of modules | number of releases:"
+      connected.each do |author|
+        puts "  └── %-55s: %s | %d | %d" % [ "#{author['display_name']} (#{author['username']})",
+                                        author['average_dependants'].to_i,
+                                        author['module_count'],
+                                        author['release_count'] ]
+      end
+    end
+    def github()
+      require 'csv'
+      require 'net/http'
+      raise "Need to provide a data file to gather GitHub stats!" unless @github_data
+      unfound = []
+      modules = load('modules').map {|m| m['slug']}
+      CSV.foreach(@github_data) do |row|
+        repo, stars = row
+        next unless repo =~ /^\w+\/\w+$/
+        begin
+          uri_path = "https://raw.githubusercontent.com/#{repo}/master/metadata.json"
+          metadata = JSON.parse(Net::HTTP.get(URI.parse(uri_path)))
+          unless modules.include? metadata['name'].sub('/', '-')
+            repo_path = "https://github.com/#{repo}"
+            unfound  << { :repo => repo_path, :stars => stars}
+          end
+        rescue => e
+          puts "#{e.class} for #{uri_path}"
+        end
+      end
+      # sort the list by number of stars, descending then alphabatize by repo
+      unfound.sort! do |a, b|
+        [b[:stars], a[:repo]] <=> [a[:stars], b[:repo]]
+      end
+      if @output_file
+        CSV.open("outreach.csv", "w+") do |csv|
+          unfound.each do |mod|
+            csv << [ mod[:repo], mod[:stars] ]
+          end
+        end
+      end
+      puts "The following #{unfound.count} module repositories were not represented on the Forge:" unless unfound.empty?
+      unfound.each do |mod|
+        puts "  └── %-65s: %d" % [ mod[:repo], mod[:stars] ]
+      end
+    end
+    def validations()
+      puts 'got nothing for you yet'
+    end
+    def test()
+      require 'pry'
+      binding.pry
+    end
+  end
+end

data/lib/mvp/uploader.rb ADDED

@@ -0,0 +1,100 @@
+require 'json'
+require 'tty-spinner'
+require "google/cloud/bigquery"
+class Mvp
+  class Uploader
+    def initialize(options = {})
+      @cachedir = options[:cachedir]
+      @mirrors  = options[:gcloud][:mirrors]
+      @bigquery = Google::Cloud::Bigquery.new(
+        :project_id  => options[:gcloud][:project],
+        :credentials => Google::Cloud::Bigquery::Credentials.new(options[:gcloud][:keyfile]),
+      )
+      @dataset = @bigquery.dataset(options[:gcloud][:dataset])
+    end
+    def authors()
+      upload('authors')
+    end
+    def modules()
+      upload('modules')
+    end
+    def releases()
+      upload('releases')
+    end
+    def validations()
+      upload('validations')
+    end
+    def mirrors()
+      @mirrors.each do |entity|
+        begin
+          spinner = TTY::Spinner.new("[:spinner] :title")
+          spinner.update(title: "Mirroring #{entity[:type]} #{entity[:name]} to BigQuery...")
+          spinner.auto_spin
+          case entity[:type]
+          when :view
+            @dataset.table(entity[:name]).delete rescue nil # delete if exists
+            @dataset.create_view(entity[:name], entity[:query],
+                                  :legacy_sql => true)
+          when :table
+            job = @dataset.query_job(entity[:query],
+                                  :legacy_sql => true,
+                                  :write      => 'truncate',
+                                  :table      => @dataset.table(entity[:name], :skip_lookup => true))
+            job.wait_until_done!
+          else
+            $logger.error "Unknown mirror type: #{entity[:type]}"
+          end
+          spinner.success('(OK)')
+        rescue => e
+          spinner.error("(Google Cloud error: #{e.message})")
+          $logger.error e.backtrace.join("\n")
+        end
+      end
+    end
+    def upload(entity)
+      begin
+        spinner = TTY::Spinner.new("[:spinner] :title")
+        spinner.update(title: "Uploading #{entity} to BigQuery ...")
+        spinner.auto_spin
+        @dataset.load("forge_#{entity}", "#{@cachedir}/nld_#{entity}.json",
+                        :write      => 'truncate',
+                        :autodetect => true)
+#         table = @dataset.table("forge_#{entity}")
+#         File.readlines("#{@cachedir}/nld_#{entity}.json").each do |line|
+#           data = JSON.parse(line)
+#
+#           begin
+#             table.insert data
+#           rescue
+#             require 'pry'
+#             binding.pry
+#           end
+#         end
+        spinner.success('(OK)')
+      rescue => e
+        spinner.error("(Google Cloud error: #{e.message})")
+        $logger.error e.backtrace.join("\n")
+      end
+    end
+    def test()
+      require 'pry'
+      binding.pry
+    end
+  end
+end

metadata ADDED

@@ -0,0 +1,170 @@
+--- !ruby/object:Gem::Specification
+name: puppet-community-mvp
+version: !ruby/object:Gem::Version
+  version: 0.0.1
+platform: ruby
+authors:
+- Ben Ford
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2018-06-27 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: json
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: histogram
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: ascii_charts
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: sparkr
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: semantic_puppet
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: httparty
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: tty-spinner
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: google-cloud
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+description: |2
+    Nothing exciting. Just gathers stats about the Puppet Community. Currently
+    draws data from the Puppet Forge, GitHub, and Slack. Optionally pushes data
+    into BigQuery for later consumption.
+    Run `mvp --help` to get started.
+email: ben.ford@puppet.com
+executables:
+- mvp
+extensions: []
+extra_rdoc_files: []
+files:
+- LICENSE
+- README.md
+- bin/mvp
+- lib/mvp.rb
+- lib/mvp/downloader.rb
+- lib/mvp/monkeypatches.rb
+- lib/mvp/runner.rb
+- lib/mvp/stats.rb
+- lib/mvp/uploader.rb
+homepage:
+licenses:
+- Apache 2
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.5.2.3
+signing_key:
+specification_version: 4
+summary: Generate some stats about the Puppet Community.
+test_files: []