beauvoir 0.0.1a → 0.0.2b

Sign up to get free protection for your applications and to get access to all the features.
Files changed (5) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +13 -1
  3. data/lib/beauvoir.rb +43 -21
  4. data/lib/name.rb +58 -9
  5. metadata +22 -28
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 9c046bac3fbc6a13b9f585aecaa4c03fa257e66f
4
- data.tar.gz: a47a01be8e526d0deefab9c1bf7d89c8ac99f5d4
3
+ metadata.gz: d08d58653a1e5586213426ba0026787bb0c9d5c6
4
+ data.tar.gz: 65f2d360cec83211db9809f6025195f243737720
5
5
  SHA512:
6
- metadata.gz: c61d84c385ca7c556d052c6be4788c55e39a852a9ec324d5e00c7becb1c4e66ae8ba71d1f6b5043a799d6b6061a3500a4fa6222811916cb5fd28546592b0045c
7
- data.tar.gz: 57da2d4ada7c580d07715713d7296bbb7c68ea049affe8339a7c535c852264040a81f4229b870e12271d5cc39103a0bde15f37ee4927a848fc28d8aa475764b1
6
+ metadata.gz: 495d9a48b07c37cb9110f4787cc75b8c4b9bc79089c34e493d541f09514223708ffabb995d199e9fd786b7a03307f0f3e73616e5a0e691609cab1c42563c8b7e
7
+ data.tar.gz: 189b378c9f7eacee1acaf206a60d975fd9f1c4108eb2b8cbad53a4535c7154f74a6398fea2ae2f66a0f2f1a90466170b6bd147d3e7e654d220fc10208b24be7b
data/README.md CHANGED
@@ -5,6 +5,10 @@ Beauvoir is a gem for guessing a person's gender by their first name. Caveats ap
5
5
 
6
6
  Beauvoir uses more precise data sourced from [Open Gender Tracker](http://opengendertracking.org/)'s [Global Name Data](https://github.com/OpenGenderTracking/globalnamedata). Beauvoir lets you set avg and lower bounds and choose countries from which to draw data (so far US, UK only, more to come soon).
7
7
 
8
+ Caution
9
+ -------
10
+ This is pre-alpha software. The API will change, I guarantee it.
11
+
8
12
  Caveats
9
13
  -------
10
14
 
@@ -56,6 +60,14 @@ Methodology
56
60
  -----------
57
61
  to come soon. email me if you need it asap.
58
62
 
63
+ A minimum estimated value: a best guess of the ratio of genders of people with a given name.
64
+ A minimum lower confidence bound: only 2.5 times out of a hundred (by default) with the _actual_ proportion of genders of people with this name fall below this bound. (It will be above outside the confidence interval 5% of the time, half of which is above and half of which is below. Ninety-five percent of the time, it will fall within the confidence interval. )
65
+ Set a level of statistical significance -- by default 0.95. The lower this level, the more likely the "true" parameter will fall outside the interval.
66
+
59
67
  Name
60
68
  ----
61
- Beauvoir is named after [Simone de Beauvoir](http://en.wikipedia.org/wiki/Simone_de_Beauvoir), a feminist author best known for _The Second Sex), which Wikipeida describes as "a detailed analysis of women's oppression and a foundational tract of contemporary feminism." Hopefully Beauvoir (the Gem) will have some effect by assisting computational efforts to shed light on gender bias and discrimination.
69
+ Beauvoir is named after [Simone de Beauvoir](http://en.wikipedia.org/wiki/Simone_de_Beauvoir), a feminist author best known for _The Second Sex), which Wikipedia describes as "a detailed analysis of women's oppression and a foundational tract of contemporary feminism." Hopefully Beauvoir (the Gem) will have some effect by assisting computational efforts to shed light on gender bias and discrimination.
70
+
71
+ TODO
72
+ ----
73
+ - Test A.C. Confidence Interval, Expected Value results against the ones in the source data. (The R is abstruse like a goose, so I'm not sure it's implemented right.)
@@ -1,25 +1,24 @@
1
1
  require 'csv'
2
2
  require 'set'
3
- require './lib/name'
3
+ require_relative './name'
4
4
 
5
5
  class Beauvoir
6
- DEFAULT_AVERAGE_CONFIDENCE_THRESHOLD = 0.99
7
- DEFAULT_LOWER_CONFIDENCE_THRESHOLD = 0.5
6
+ DEFAULT_PROPORTION_THRESHOLD = 0.99
7
+ DEFAULT_LOWER_CONFIDENCE_BOUND = 0.75
8
+
8
9
 
9
10
  # these aren't writable because once a Beauvoir is initialized, since their
10
11
  # value is baked into Beauvoir's internal judgments of gender.
11
- attr_reader :threshold, :lower_threshold, :names_by_names, :names_gendersh
12
+ attr_reader :threshold, :lower_confidence_bound, :names_by_names, :names_genders
12
13
 
13
14
  def initialize(options={})
14
-
15
-
16
15
  countries = Set.new([:us, :uk])
17
16
 
18
- @threshold = options[:threshold] || DEFAULT_AVERAGE_CONFIDENCE_THRESHOLD
17
+ @threshold = options[:threshold] || DEFAULT_PROPORTION_THRESHOLD
19
18
 
20
19
  # TODO: what should this be in the default case? (0, i.e. ignore the lower bound?, some sensical value to
21
20
  # exclude a naive user from getting back nonsense? the bare minimum value for a loose significance level?)
22
- @lower_threshold = options[:lower_threshold] || DEFAULT_LOWER_CONFIDENCE_THRESHOLD
21
+ @lower_confidence_bound = options[:lower_confidence_bound] || DEFAULT_LOWER_CONFIDENCE_BOUND
23
22
 
24
23
  @names_by_names = {}
25
24
  # @country_totals = {}
@@ -37,11 +36,13 @@ class Beauvoir
37
36
  # to avoid ~10sec delay when loading into memory
38
37
  #(e.g. seeking around the file?)
39
38
  countries.each do |country|
40
- CSV.open("lib/data/#{country}processed.csv", :headers => true).each do |row|
41
- name = @names_by_names.fetch(row["Name"], Name.new(row["Name"]))
39
+
40
+ CSV.open(File.join(File.dirname(File.expand_path(__FILE__)), "data/#{country}processed.csv"), :headers => true).each do |row|
41
+ name_str = Beauvoir.normalize(row["Name"])
42
+ name = @names_by_names.fetch(name_str, Name.new(name_str))
42
43
  name.male_count += row["count.male"].to_i
43
44
  name.female_count += row["count.female"].to_i
44
- @names_by_names[row["Name"]] = name
45
+ @names_by_names[name_str] = name
45
46
  end
46
47
  end
47
48
 
@@ -52,20 +53,25 @@ class Beauvoir
52
53
  :unknown
53
54
  end
54
55
  end
55
- raise NoJohnError unless @names_by_names.include?("John")
56
56
  self
57
57
  end
58
58
 
59
59
  def sufficiently_confident(name)
60
60
  (name.male_proportion > @threshold || name.female_proportion > @threshold) &&
61
- name.lower > @lower_threshold
61
+ name.lower > @lower_confidence_bound
62
62
  end
63
63
 
64
64
  #
65
- # Transform any name-like string into an unpadded, initial-cased first name
65
+ # Transform any name-like string into an unpadded, initial-cased first name.
66
+ # Should be a surjection, mapping many possible inputs (e.g. "Jeremy", "Jeremy.", "JEREMY", "Jeremy B. Merrill")
67
+ # onto one single name.
68
+ # This is used for two things:
69
+ # 1. Accepting differently-formatted/tokenized names from the user.
70
+ # 2. Dealing with differently-formatted names from the source agencies (e.g. "Mckinley" v. "McKinley", "Obrien", vs. "O'brien")
66
71
  #
67
- def normalize(name)
68
- name.gsub!(/[^A-Za-z \-\']+/, '') #TODO: can this be done more efficiently with String#tr?
72
+ def self.normalize(name)
73
+ name.tr!("^A-Za-z' \-", '')
74
+ # name.gsub!(/[^A-Za-z \-\']+/, '') #this I suspect is done more efficiently with String#tr
69
75
  if name.include?(" ")
70
76
  name = name[0...name.index(" ")]
71
77
  end
@@ -73,18 +79,34 @@ class Beauvoir
73
79
  end
74
80
 
75
81
  def guess(name)
76
- @names_genders.fetch(normalize(name), :unknown)
82
+ @names_genders.fetch(Beauvoir.normalize(name), :unknown)
83
+ end
84
+
85
+ def estimated_male_value(name)
86
+ if name_obj = @names_by_names[Beauvoir.normalize(name)]
87
+ name_obj.estimated_male_value
88
+ else
89
+ nil
90
+ end
91
+ end
92
+
93
+ def estimated_female_value(name)
94
+ if name_obj = @names_by_names[Beauvoir.normalize(name)]
95
+ name_obj.estimated_female_value
96
+ else
97
+ nil
98
+ end
77
99
  end
78
100
 
79
- def maleness_ratio(name)
80
- if name_obj = @names_by_names[normalize(name)]
101
+ def male_proportion(name)
102
+ if name_obj = @names_by_names[Beauvoir.normalize(name)]
81
103
  name_obj.male_proportion
82
104
  else
83
105
  nil
84
106
  end
85
107
  end
86
- def femaleness_ratio(name)
87
- if name_obj = @names_by_names[normalize(name)]
108
+ def female_proportion(name)
109
+ if name_obj = @names_by_names[Beauvoir.normalize(name)]
88
110
  name_obj.female_proportion
89
111
  else
90
112
  nil
@@ -3,13 +3,20 @@
3
3
  class Name
4
4
  attr_accessor :male_count, :female_count, :name
5
5
 
6
- def initialize(name)
6
+ def initialize(name, options={})
7
+ # default_options = {
8
+ # :significance_level => 0.95,
9
+ # }
10
+ @options = options #default_options.merge(options)
11
+
7
12
  @male_count = 0
8
13
  @female_count = 0
9
14
  @name = name
15
+ # @significance_level = @options[:significance_level]
10
16
  end
11
17
 
12
- def male?
18
+ def male?
19
+ #pure proportions, so even the slightest greater proportion of one gender will affect this
13
20
  @male_count > @female_count
14
21
  end
15
22
 
@@ -18,7 +25,13 @@ class Name
18
25
  end
19
26
 
20
27
  def gender
21
- female? ? :female : male? ? :male : :unknown #oh god this is unreadable
28
+ if female?
29
+ :female
30
+ elsif male?
31
+ :male
32
+ else
33
+ :unknown
34
+ end
22
35
  end
23
36
 
24
37
  def female_proportion
@@ -35,12 +48,48 @@ class Name
35
48
  (@male_count + @female_count).to_f
36
49
  end
37
50
 
38
- #implements Agresti-Coull binomial confidence interval, returns lower bound of higher of male/female
39
- # cf. http://codesequoia.wordpress.com/2010/12/06/unit-test-and-statistics/
51
+
52
+ # fancy statistics!
53
+ #------------------
54
+ # implements Agresti-Coull estimated value and binomial confidence interval
55
+ # via:
56
+ # - http://codesequoia.wordpress.com/2010/12/06/unit-test-and-statistics/
57
+ # - http://stackoverflow.com/questions/3749125/how-should-i-order-these-helpful-scores/3752941#3752941
58
+ # -
59
+ #this is apparently related to alpha and related to the level of statistical significance we care about.
60
+ # 1.96 pertains to a 0.95 significance level.
61
+ #
62
+ MAGIC_STATISTICS_NUMBER = 1.96
63
+
64
+ def z
65
+ # TODO: https://github.com/clbustos/statsample/blob/1168d58b14a5095af0a639b4843b31433d40f105/lib/statsample/srs.rb
66
+ #@significance_level #do stuff with this.
67
+ MAGIC_STATISTICS_NUMBER
68
+ end
69
+
70
+ def estimated_female_value
71
+ estimated_value_formula(@female_count)
72
+ end
73
+
74
+ def estimated_male_value
75
+ estimated_value_formula(@male_count)
76
+ end
77
+
78
+ def estimated_value
79
+ estimated_value_formula([@male_count, @female_count].max)
80
+ end
81
+
82
+ # returns lower bound of higher of male/female
40
83
  def lower
41
- nt = total + 1.96 ** 2
42
- pt = ([@male_count, @female_count].max + (1.96 ** 2) / 2) / nt
43
- e = 1.96 * Math.sqrt(pt * (1 - pt) / nt)
44
- [female_proportion, male_proportion].max - e
84
+ nt = total + z ** 2
85
+ interval = z * Math.sqrt(estimated_value * (1 - estimated_value) / nt)
86
+ [female_proportion, male_proportion].max - interval
87
+ end
88
+
89
+ private
90
+
91
+ def estimated_value_formula(observed)
92
+ nt = total + z ** 2
93
+ (observed + ((z ** 2) / 2)) / nt
45
94
  end
46
95
  end
metadata CHANGED
@@ -1,38 +1,33 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: beauvoir
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1a
4
+ version: 0.0.2b
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jeremy B. Merrill
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2013-10-18 00:00:00.000000000 Z
11
+ date: 2013-10-26 00:00:00.000000000 Z
12
12
  dependencies: []
13
- description: |-
14
- Guess gender by a first name using more detailed, better
15
- sourced data from Open Gender Tracker's Global Name Data.
16
- Beauvoir lets you set avg and lower bounds and choose
17
- countries from which to draw data.
18
-
19
- It's important to note that many people identify as neither
20
- a men nor a women. It's important, too, to note that many
21
- people who do identify as male or female have names that
22
- are held by far more people who identify as another gender.
23
- All of these people deserve not to be misgendered in public
24
- (or in private).
25
-
26
- Nevertheless, automatically classifying people by apparent
27
- gender can be a very useful tool to perform censuses of
28
- communities or publications to detect and quantify
29
- perhaps-invisible bias. VIDA is a pioneer in this field,
30
- but their "Count" is limited by a manual methodology that
31
- depends hundreds of person-hours of labor. There is a place
32
- for more automated counts and Beauvoir can help, but if you
33
- do a count like this, you should be careful in how you word
34
- your findings not to misgender anyone in particular and be
35
- responsive to the possibility of errors.
13
+ description: "Guess gender by a first name using more detailed, better\n sourced
14
+ data from Open Gender Tracker's Global Name Data.<br />\n Beauvoir
15
+ lets you set avg and lower bounds and choose\n countries from
16
+ which to draw data. \n\n\n It's important to note that many people
17
+ identify as neither\n a men nor a women. It's important, too,
18
+ to note that many\n people who do identify as male or female
19
+ have names that<br />\n are held by far more people who identify
20
+ as another gender.\n All of these people deserve not to be misgendered
21
+ in public\n (or in private).\n\n Nevertheless,
22
+ automatically classifying people by apparent\n gender can be
23
+ a very useful tool to perform censuses of\n communities or publications
24
+ to detect and quantify\n perhaps-invisible bias. VIDA is a pioneer
25
+ in this field,\n but their \"Count\" is limited by a manual methodology
26
+ that\n depends hundreds of person-hours of labor. There is a
27
+ place\n for more automated counts and Beauvoir can help, but
28
+ if you\n do a count like this, you should be careful in how you
29
+ word\n your findings not to misgender anyone in particular and
30
+ be\n responsive to the possibility of errors."
36
31
  email: jeremybmerrill@jeremybmerrill.com
37
32
  executables: []
38
33
  extensions: []
@@ -42,8 +37,8 @@ files:
42
37
  - lib/name.rb
43
38
  - LICENSE
44
39
  - README.md
45
- - lib/data/ukprocessed.csv
46
40
  - lib/data/usprocessed.csv
41
+ - lib/data/ukprocessed.csv
47
42
  homepage: http://rubygems.org/gems/beauvoir
48
43
  licenses:
49
44
  - MIT
@@ -64,9 +59,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
64
59
  version: 1.3.1
65
60
  requirements: []
66
61
  rubyforge_project:
67
- rubygems_version: 2.0.2
62
+ rubygems_version: 2.0.3
68
63
  signing_key:
69
64
  specification_version: 4
70
65
  summary: Guess a person's gender by their first name
71
66
  test_files: []
72
- has_rdoc: