RubyGems - beauvoir - Versions diffs - 0.0.2c → 0.0.3 - Mend

beauvoir 0.0.2c → 0.0.3

Files changed (7) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 9f58dceb149ae08f0e1ce66318423fd316a1b4f9
-  data.tar.gz: 5c1ceaeaee4f6f437b6afaa460c63e0415e1dfa1
+  metadata.gz: 37155ed7c902ddc28438c6b4ad54b6cb83bad8e0
+  data.tar.gz: de8d1e463cc33fb728718a3383b5d0cc4c2eaa1b
 SHA512:
-  metadata.gz: 34ca58e2b539d7ed9b5e1be9a6a3470681e537f57a69a554d82e0affad33030e9574c985a13ef55dc0a9ee19e301df25f192de56668030635002fddcb44c6a12
-  data.tar.gz: f4206f47ecc7dccb2dfdc05b910fbdf2d5b8cb41eeb12caee87bd13d27e845e901204c590c421e01ed2c8205bd81559be6c09878fbc8a441a52bf40cdb4e278e
+  metadata.gz: 1e06f1a247c296122b17670309b74b5d6230bc08a5a4a609ab57a583bdc64726bc2ef4cf7fe82d985dd7c910b895f967c5a71ccd488de8003488180df5f80299
+  data.tar.gz: 94c3c3c697c23f0b43fdb49d4688689a02913c8db81eb46e7482e38cf850c0c76fcc01e92670400647cd92ee3e89b1c111fed1f78470e1d7e13adb09be076f51

data/README.md CHANGED

@@ -12,9 +12,9 @@ This is pre-alpha software. The API will change, I guarantee it.
 Caveats
 -------
-It's important to note that many people identify as neither a men nor a women. It's important, too, to note that many people who do identify as male or female have names for which most other people with that name identify as a different gender. All of these people deserve not to be misgendered.
+It's important to note that many people identify as neither a man nor a woman. It's important, too, to note that many people who do identify as male or female have names for which most other people with that name identify as a different gender. All of these people deserve not to be misgendered.
-Nevertheless, automatically classifying people by apparent gender can be a very useful tool to perform censuses of communities or publications to detect and quantify perhaps-invisible bias. VIDA is a pioneer in performing theses censuses, but their \"Count\" is limited by a manual methodology that depends hundreds of person-hours of labor. There is a place for more automated counts and Beauvoir can help, but if you plan to publish a count like this, you should be careful. Beauvoir's confidence thresholds are set very high by default on purpose, you shouldn't lower them unless you take other steps to make sure that you're very unlikely to misgender someone; you should also be prepared to be responsive and respectful if you do. You should include your methodology, prominently. You might consider emphasizing aggregate numbers over your mapping of individual people's names to genders.
+Nevertheless, automatically classifying people by apparent gender can be a very useful tool to perform censuses of communities or publications to detect and quantify perhaps-invisible bias. VIDA is a pioneer in performing theses censuses, but their "Count" is limited by a manual methodology that depends hundreds of person-hours of labor. There is a place for more automated counts and Beauvoir can help, but if you plan to publish a count like this, you should be careful. Beauvoir's confidence thresholds are set very high by default on purpose, you shouldn't lower them unless you take other steps to make sure that you're very unlikely to misgender someone; you should also be prepared to be responsive and respectful if you do. You should include your methodology, prominently. You might also consider emphasizing aggregate numbers over your mapping of individual people's names to genders.
 Usage
 -----

data/lib/beauvoir.rb CHANGED

@@ -1,123 +1,122 @@
 require 'csv'
 require 'set'
-require_relative './name'
+require_relative './beauvoir/statistics'
+require_relative './beauvoir/name'
-class Beauvoir
-  DEFAULT_PROPORTION_THRESHOLD = 0.99
-  DEFAULT_LOWER_CONFIDENCE_BOUND = 0.75
+module Beauvoir
+  class Categorizer
+    DEFAULT_PROPORTION_THRESHOLD = 0.99
+    DEFAULT_LOWER_CONFIDENCE_BOUND = 0.75
-  # these aren't writable because once a Beauvoir is initialized, since their
-  # value is baked into Beauvoir's internal judgments of gender.
-  attr_reader :threshold, :lower_confidence_bound, :names_by_names, :names_genders
+    # these aren't writable because once a Beauvoir is initialized, since their
+    # value is baked into Beauvoir's internal judgments of gender.
+    attr_reader :threshold, :lower_confidence_bound, :names_by_names, :names_genders
-  def initialize(options={})
-    countries = Set.new([:us, :uk])
+    def initialize(options={})
+      countries = Set.new([:us, :uk])
-    @threshold = options[:threshold] || DEFAULT_PROPORTION_THRESHOLD
+      @threshold = options[:threshold] || DEFAULT_PROPORTION_THRESHOLD
-    # TODO: what should this be in the default case? (0, i.e. ignore the lower bound?, some sensical value to
-    # exclude a naive user from getting back nonsense? the bare minimum value for a loose significance level?)
-    @lower_confidence_bound = options[:lower_confidence_bound] || DEFAULT_LOWER_CONFIDENCE_BOUND
+      # TODO: what should this be in the default case? (0, i.e. ignore the lower bound?, some sensical value to
+      # exclude a naive user from getting back nonsense? the bare minimum value for a loose significance level?)
+      @lower_confidence_bound = options[:lower_confidence_bound] || DEFAULT_LOWER_CONFIDENCE_BOUND
-    @names_by_names = {}
-    # @country_totals = {}
-    @names_genders = {}
+      @names_by_names = {}
+      # @country_totals = {}
+      @names_genders = {}
-    if options[:country] && !options[:countries]
-      countries &= Set.new([options[:country].to_sym])
-    elsif options[:countries] && !options[:country]
-      countries &= Set.new(options[:countries].map(&:to_sym))
-    elsif options[:countries] && options[:country]
-      raise ArgumentError, "Specify either :country or :countries, not both."
-    end
+      if options[:country] && !options[:countries]
+        countries &= Set.new([options[:country].to_sym])
+      elsif options[:countries] && !options[:country]
+        countries &= Set.new(options[:countries].map(&:to_sym))
+      elsif options[:countries] && options[:country]
+        raise ArgumentError, "Specify either :country or :countries, not both."
+      end
-    #TODO: consider "piecewise" loading with stashing of already-loaded names
-    # to avoid ~10sec delay when loading into memory
-    #(e.g. seeking around the file?)
-    countries.each do |country|
-      CSV.open(File.join(File.dirname(File.expand_path(__FILE__)), "data/#{country}processed.csv"), :headers => true).each do |row|
-        name_str = Beauvoir.normalize(row["Name"])
-        name = @names_by_names.fetch(name_str, Name.new(name_str))
-        name.male_count += row["count.male"].to_i
-        name.female_count += row["count.female"].to_i
-        @names_by_names[name_str] = name
+      #TODO: consider "piecewise" loading with stashing of already-loaded names
+      # to avoid ~10sec delay when loading into memory
+      #(e.g. seeking around the file?)
+      countries.each do |country|
+        CSV.open(File.join(File.dirname(File.expand_path(__FILE__)), "data/#{country}processed.csv"), :headers => true).each do |row|
+          name_str = Beauvoir::Categorizer.normalize(row["Name"])
+          name = @names_by_names.fetch(name_str, Name.new(name_str))
+          name.male_count += row["count.male"].to_i
+          name.female_count += row["count.female"].to_i
+          @names_by_names[name_str] = name
+        end
       end
-    end
-    @names_by_names.values.each do |name|
-      @names_genders[name.name] = if sufficiently_confident(name)
-                                    name.gender
-                                  else
-                                    :unknown
-                                  end
+      @names_by_names.values.each do |name|
+        @names_genders[name.name] = name.guess_gender(@threshold, @lower_confidence_bound)
+      end
+      self
     end
-    self
-  end
-  def sufficiently_confident(name)
-    (name.male_proportion > @threshold || name.female_proportion > @threshold) &&
-      name.lower > @lower_confidence_bound
-  end
-  #
-  # Transform any name-like string into an unpadded, initial-cased first name.
-  # Should be a surjection, mapping many possible inputs (e.g. "Jeremy", "Jeremy.", "JEREMY", "Jeremy B. Merrill")
-  # onto one single name.
-  # This is used for two things:
-  # 1. Accepting differently-formatted/tokenized names from the user.
-  # 2. Dealing with differently-formatted names from the source agencies (e.g. "Mckinley" v. "McKinley", "Obrien", vs. "O'brien")
-  #
-  def self.normalize(name)
-    name.tr!("^A-Za-z' \-", '')
-    # name.gsub!(/[^A-Za-z \-\']+/, '') #this I suspect is done more efficiently with String#tr
-    if name.include?(" ")
-      name = name[0...name.index(" ")]
+    # def sufficiently_confident(name)
+    #   (name.male_proportion > @threshold || name.female_proportion > @threshold) &&
+    #     name.lower > @lower_confidence_bound
+    # end
+    #
+    # Transform any name-like string into an unpadded, initial-cased first name.
+    # Should be a surjection, mapping many possible inputs (e.g. "Jeremy", "Jeremy.", "JEREMY", "Jeremy B. Merrill")
+    # onto one single name.
+    # This is used for two things:
+    # 1. Accepting differently-formatted/tokenized names from the user.
+    # 2. Dealing with differently-formatted names from the source agencies (e.g. "Mckinley" v. "McKinley", "Obrien", vs. "O'brien")
+    #
+    def self.normalize(name)
+      name.tr!("^A-Za-z' \-", '')
+      # name.gsub!(/[^A-Za-z \-\']+/, '') #this I suspect is done more efficiently with String#tr
+      if name.include?(" ")
+        name = name[0...name.index(" ")]
+      end
+      name[0].upcase + name[1..-1].downcase
     end
-    name[0].upcase + name[1..-1].downcase
-  end
-  def guess(name)
-    @names_genders.fetch(Beauvoir.normalize(name), :unknown)
-  end
+    def guess(name)
+      @names_genders.fetch(Beauvoir::Categorizer.normalize(name), :unknown)
+    end
-  def estimated_male_value(name)
-    if name_obj = @names_by_names[Beauvoir.normalize(name)]
-      name_obj.estimated_male_value
-    else
-      nil
+    def estimated_male_value(name)
+      if name_obj = @names_by_names[Beauvoir::Categorizer.normalize(name)]
+        name_obj.estimated_male_value
+      else
+        nil
+      end
     end
-  end
-  def estimated_female_value(name)
-    if name_obj = @names_by_names[Beauvoir.normalize(name)]
-      name_obj.estimated_female_value
-    else
-      nil
+    def estimated_female_value(name)
+      if name_obj = @names_by_names[Beauvoir::Categorizer.normalize(name)]
+        name_obj.estimated_female_value
+      else
+        nil
+      end
     end
-  end
-  def male_proportion(name)
-    if name_obj = @names_by_names[Beauvoir.normalize(name)]
-      name_obj.male_proportion
-    else
-      nil
+    def raw_male_proportion(name)
+      if name_obj = @names_by_names[Beauvoir::Categorizer.normalize(name)]
+        name_obj.raw_male_proportion
+      else
+        nil
+      end
     end
-  end
-  def female_proportion(name)
-    if name_obj = @names_by_names[Beauvoir.normalize(name)]
-      name_obj.female_proportion
-    else
-      nil
+    def raw_female_proportion(name)
+      if name_obj = @names_by_names[Beauvoir::Categorizer.normalize(name)]
+        name_obj.raw_female_proportion
+      else
+        nil
+      end
     end
-  end
-  def inspect
-    inspect_string = "#<#{self.class.name}:0x#{(self.object_id*2).to_s(16)} "
-    exclude = [:@names_by_names, :@names_genders]
-    fields = self.instance_variables - exclude
-    inspect_string << fields.map{|field| "#{field}=#{instance_variable_get(field)}"}.join(", ") << ">"
-    inspect_string
+    def inspect
+      inspect_string = "#<#{self.class.name}:0x#{(self.object_id*2).to_s(16)} "
+      exclude = [:@names_by_names, :@names_genders]
+      fields = self.instance_variables - exclude
+      inspect_string << fields.map{|field| "#{field}=#{instance_variable_get(field)}"}.join(", ") << ">"
+      inspect_string
+    end
   end
 end

data/lib/beauvoir/name.rb ADDED

@@ -0,0 +1,70 @@
+module Beauvoir
+  class Name
+    include Beauvoir::Statistics
+    attr_accessor :male_count, :female_count, :name
+    def initialize(name, options={})
+      # default_options = {
+      #   :significance_level => 0.95,
+      # }
+      @options = options #default_options.merge(options)
+      @male_count = 0
+      @female_count = 0
+      @name = name
+      # @significance_level = @options[:significance_level]
+    end
+    def guess_gender(threshold=DEFAULT_PROPORTION_THRESHOLD, lower_confidence_bound=DEFAULT_LOWER_CONFIDENCE_BOUND)
+      if sufficiently_confident(threshold, lower_confidence_bound)
+        gender
+      else
+        :unknown
+      end
+    end
+    def raw_female_proportion
+      return 0 unless self.total > 0
+      @female_count / self.total
+    end
+    def raw_male_proportion
+      return 0 unless self.total > 0
+      @male_count / self.total
+    end
+    def total
+      (@male_count + @female_count).to_f
+    end
+    private
+      # These methods are private for a reason.
+      # You should use the guess_gender method instead.
+      # (See README.md for more discussion.)
+      def female?
+        #pure proportions, so even the slightest greater proportion of one gender will affect this
+        @female_count > @male_count
+      end
+      def male?
+        #pure proportions, so even the slightest greater proportion of one gender will affect this
+        @male_count > @female_count
+      end
+      def gender
+        if female?
+          :female
+        elsif male?
+          :male
+        else
+          :unknown
+        end
+      end
+      def sufficiently_confident(threshold=DEFAULT_PROPORTION_THRESHOLD, lower_confidence_bound=DEFAULT_LOWER_CONFIDENCE_BOUND)
+        (raw_male_proportion > threshold || raw_female_proportion > threshold) &&
+          lower > lower_confidence_bound
+      end
+  end
+end

data/lib/beauvoir/statistics.rb ADDED

@@ -0,0 +1,48 @@
+require_relative './statistics'
+module Beauvoir
+  module Statistics
+      # fancy statistics!
+    #------------------
+    # implements Agresti-Coull estimated value and binomial confidence interval
+    # via:
+    # - http://codesequoia.wordpress.com/2010/12/06/unit-test-and-statistics/
+    # - http://stackoverflow.com/questions/3749125/how-should-i-order-these-helpful-scores/3752941#3752941
+    # -
+    # the MAGIC_STATISTICS_NUMBER is apparently related to alpha and related to
+    # the level of statistical significance we care about.
+    # 1.96 pertains to a 0.95 significance level.
+    #
+    MAGIC_STATISTICS_NUMBER = 1.96
+    def z
+      # TODO: https://github.com/clbustos/statsample/blob/1168d58b14a5095af0a639b4843b31433d40f105/lib/statsample/srs.rb
+      #@significance_level #do stuff with this.
+      MAGIC_STATISTICS_NUMBER
+    end
+    def estimated_female_value
+      estimated_value_formula(@female_count)
+    end
+    def estimated_male_value
+      estimated_value_formula(@male_count)
+    end
+    def estimated_value
+      estimated_value_formula([@male_count, @female_count].max)
+    end
+    #  returns lower bound of higher of male/female
+    def lower
+      nt = total + z ** 2
+      interval = z * Math.sqrt(estimated_value * (1 - estimated_value) / nt)
+      [raw_female_proportion, raw_male_proportion].max - interval
+    end
+    def estimated_value_formula(observed)
+      nt = total + z ** 2
+      (observed + ((z ** 2) / 2)) / nt
+    end
+  end
+end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: beauvoir
 version: !ruby/object:Gem::Version
-  version: 0.0.2c
+  version: 0.0.3
 platform: ruby
 authors:
 - Jeremy B. Merrill
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-11-14 00:00:00.000000000 Z
+date: 2014-01-02 00:00:00.000000000 Z
 dependencies: []
 description: "Guess gender by a first name using more detailed, better\n                    sourced
   data from Open Gender Tracker's Global Name Data.<br />\n                    Beauvoir
@@ -33,12 +33,13 @@ executables: []
 extensions: []
 extra_rdoc_files: []
 files:
-- lib/beauvoir.rb
-- lib/name.rb
 - LICENSE
 - README.md
-- lib/data/usprocessed.csv
+- lib/beauvoir.rb
+- lib/beauvoir/name.rb
+- lib/beauvoir/statistics.rb
 - lib/data/ukprocessed.csv
+- lib/data/usprocessed.csv
 homepage: http://rubygems.org/gems/beauvoir
 licenses:
 - MIT
@@ -54,13 +55,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
       version: '0'
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
-  - - '>'
+  - - '>='
     - !ruby/object:Gem::Version
-      version: 1.3.1
+      version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.0.3
+rubygems_version: 2.2.0
 signing_key:
 specification_version: 4
 summary: Guess a person's gender by their first name
 test_files: []
+has_rdoc:

data/lib/name.rb DELETED

@@ -1,95 +0,0 @@
-class Name
-  attr_accessor :male_count, :female_count, :name
-  def initialize(name, options={})
-    # default_options = {
-    #   :significance_level => 0.95,
-    # }
-    @options = options #default_options.merge(options)
-    @male_count = 0
-    @female_count = 0
-    @name = name
-    # @significance_level = @options[:significance_level]
-  end
-  def male?
-    #pure proportions, so even the slightest greater proportion of one gender will affect this
-    @male_count > @female_count
-  end
-  def female?
-    @female_count > @male_count
-  end
-  def gender
-    if female?
-      :female
-    elsif male?
-      :male
-    else
-      :unknown
-    end
-  end
-  def female_proportion
-    return 0 unless self.total > 0
-    @female_count / self.total
-  end
-  def male_proportion
-    return 0 unless self.total > 0
-    @male_count / self.total
-  end
-  def total
-    (@male_count + @female_count).to_f
-  end
-  # fancy statistics!
-  #------------------
-  # implements Agresti-Coull estimated value and binomial confidence interval
-  # via:
-  # - http://codesequoia.wordpress.com/2010/12/06/unit-test-and-statistics/
-  # - http://stackoverflow.com/questions/3749125/how-should-i-order-these-helpful-scores/3752941#3752941
-  # -
-  #this is apparently related to alpha and related to the level of statistical significance we care about.
-  # 1.96 pertains to a 0.95 significance level.
-  #
-  MAGIC_STATISTICS_NUMBER = 1.96
-  def z
-    # TODO: https://github.com/clbustos/statsample/blob/1168d58b14a5095af0a639b4843b31433d40f105/lib/statsample/srs.rb
-    #@significance_level #do stuff with this.
-    MAGIC_STATISTICS_NUMBER
-  end
-  def estimated_female_value
-    estimated_value_formula(@female_count)
-  end
-  def estimated_male_value
-    estimated_value_formula(@male_count)
-  end
-  def estimated_value
-    estimated_value_formula([@male_count, @female_count].max)
-  end
-  #  returns lower bound of higher of male/female
-  def lower
-    nt = total + z ** 2
-    interval = z * Math.sqrt(estimated_value * (1 - estimated_value) / nt)
-    [female_proportion, male_proportion].max - interval
-  end
-  private
-  def estimated_value_formula(observed)
-    nt = total + z ** 2
-    (observed + ((z ** 2) / 2)) / nt
-  end
-end