beauvoir 0.0.1a → 0.0.2b
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +13 -1
- data/lib/beauvoir.rb +43 -21
- data/lib/name.rb +58 -9
- metadata +22 -28
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: d08d58653a1e5586213426ba0026787bb0c9d5c6
|
4
|
+
data.tar.gz: 65f2d360cec83211db9809f6025195f243737720
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 495d9a48b07c37cb9110f4787cc75b8c4b9bc79089c34e493d541f09514223708ffabb995d199e9fd786b7a03307f0f3e73616e5a0e691609cab1c42563c8b7e
|
7
|
+
data.tar.gz: 189b378c9f7eacee1acaf206a60d975fd9f1c4108eb2b8cbad53a4535c7154f74a6398fea2ae2f66a0f2f1a90466170b6bd147d3e7e654d220fc10208b24be7b
|
data/README.md
CHANGED
@@ -5,6 +5,10 @@ Beauvoir is a gem for guessing a person's gender by their first name. Caveats ap
|
|
5
5
|
|
6
6
|
Beauvoir uses more precise data sourced from [Open Gender Tracker](http://opengendertracking.org/)'s [Global Name Data](https://github.com/OpenGenderTracking/globalnamedata). Beauvoir lets you set avg and lower bounds and choose countries from which to draw data (so far US, UK only, more to come soon).
|
7
7
|
|
8
|
+
Caution
|
9
|
+
-------
|
10
|
+
This is pre-alpha software. The API will change, I guarantee it.
|
11
|
+
|
8
12
|
Caveats
|
9
13
|
-------
|
10
14
|
|
@@ -56,6 +60,14 @@ Methodology
|
|
56
60
|
-----------
|
57
61
|
to come soon. email me if you need it asap.
|
58
62
|
|
63
|
+
A minimum estimated value: a best guess of the ratio of genders of people with a given name.
|
64
|
+
A minimum lower confidence bound: only 2.5 times out of a hundred (by default) with the _actual_ proportion of genders of people with this name fall below this bound. (It will be above outside the confidence interval 5% of the time, half of which is above and half of which is below. Ninety-five percent of the time, it will fall within the confidence interval. )
|
65
|
+
Set a level of statistical significance -- by default 0.95. The lower this level, the more likely the "true" parameter will fall outside the interval.
|
66
|
+
|
59
67
|
Name
|
60
68
|
----
|
61
|
-
Beauvoir is named after [Simone de Beauvoir](http://en.wikipedia.org/wiki/Simone_de_Beauvoir), a feminist author best known for _The Second Sex), which
|
69
|
+
Beauvoir is named after [Simone de Beauvoir](http://en.wikipedia.org/wiki/Simone_de_Beauvoir), a feminist author best known for _The Second Sex), which Wikipedia describes as "a detailed analysis of women's oppression and a foundational tract of contemporary feminism." Hopefully Beauvoir (the Gem) will have some effect by assisting computational efforts to shed light on gender bias and discrimination.
|
70
|
+
|
71
|
+
TODO
|
72
|
+
----
|
73
|
+
- Test A.C. Confidence Interval, Expected Value results against the ones in the source data. (The R is abstruse like a goose, so I'm not sure it's implemented right.)
|
data/lib/beauvoir.rb
CHANGED
@@ -1,25 +1,24 @@
|
|
1
1
|
require 'csv'
|
2
2
|
require 'set'
|
3
|
-
|
3
|
+
require_relative './name'
|
4
4
|
|
5
5
|
class Beauvoir
|
6
|
-
|
7
|
-
|
6
|
+
DEFAULT_PROPORTION_THRESHOLD = 0.99
|
7
|
+
DEFAULT_LOWER_CONFIDENCE_BOUND = 0.75
|
8
|
+
|
8
9
|
|
9
10
|
# these aren't writable because once a Beauvoir is initialized, since their
|
10
11
|
# value is baked into Beauvoir's internal judgments of gender.
|
11
|
-
attr_reader :threshold, :
|
12
|
+
attr_reader :threshold, :lower_confidence_bound, :names_by_names, :names_genders
|
12
13
|
|
13
14
|
def initialize(options={})
|
14
|
-
|
15
|
-
|
16
15
|
countries = Set.new([:us, :uk])
|
17
16
|
|
18
|
-
@threshold = options[:threshold] ||
|
17
|
+
@threshold = options[:threshold] || DEFAULT_PROPORTION_THRESHOLD
|
19
18
|
|
20
19
|
# TODO: what should this be in the default case? (0, i.e. ignore the lower bound?, some sensical value to
|
21
20
|
# exclude a naive user from getting back nonsense? the bare minimum value for a loose significance level?)
|
22
|
-
@
|
21
|
+
@lower_confidence_bound = options[:lower_confidence_bound] || DEFAULT_LOWER_CONFIDENCE_BOUND
|
23
22
|
|
24
23
|
@names_by_names = {}
|
25
24
|
# @country_totals = {}
|
@@ -37,11 +36,13 @@ class Beauvoir
|
|
37
36
|
# to avoid ~10sec delay when loading into memory
|
38
37
|
#(e.g. seeking around the file?)
|
39
38
|
countries.each do |country|
|
40
|
-
|
41
|
-
|
39
|
+
|
40
|
+
CSV.open(File.join(File.dirname(File.expand_path(__FILE__)), "data/#{country}processed.csv"), :headers => true).each do |row|
|
41
|
+
name_str = Beauvoir.normalize(row["Name"])
|
42
|
+
name = @names_by_names.fetch(name_str, Name.new(name_str))
|
42
43
|
name.male_count += row["count.male"].to_i
|
43
44
|
name.female_count += row["count.female"].to_i
|
44
|
-
@names_by_names[
|
45
|
+
@names_by_names[name_str] = name
|
45
46
|
end
|
46
47
|
end
|
47
48
|
|
@@ -52,20 +53,25 @@ class Beauvoir
|
|
52
53
|
:unknown
|
53
54
|
end
|
54
55
|
end
|
55
|
-
raise NoJohnError unless @names_by_names.include?("John")
|
56
56
|
self
|
57
57
|
end
|
58
58
|
|
59
59
|
def sufficiently_confident(name)
|
60
60
|
(name.male_proportion > @threshold || name.female_proportion > @threshold) &&
|
61
|
-
name.lower > @
|
61
|
+
name.lower > @lower_confidence_bound
|
62
62
|
end
|
63
63
|
|
64
64
|
#
|
65
|
-
# Transform any name-like string into an unpadded, initial-cased first name
|
65
|
+
# Transform any name-like string into an unpadded, initial-cased first name.
|
66
|
+
# Should be a surjection, mapping many possible inputs (e.g. "Jeremy", "Jeremy.", "JEREMY", "Jeremy B. Merrill")
|
67
|
+
# onto one single name.
|
68
|
+
# This is used for two things:
|
69
|
+
# 1. Accepting differently-formatted/tokenized names from the user.
|
70
|
+
# 2. Dealing with differently-formatted names from the source agencies (e.g. "Mckinley" v. "McKinley", "Obrien", vs. "O'brien")
|
66
71
|
#
|
67
|
-
def normalize(name)
|
68
|
-
name.
|
72
|
+
def self.normalize(name)
|
73
|
+
name.tr!("^A-Za-z' \-", '')
|
74
|
+
# name.gsub!(/[^A-Za-z \-\']+/, '') #this I suspect is done more efficiently with String#tr
|
69
75
|
if name.include?(" ")
|
70
76
|
name = name[0...name.index(" ")]
|
71
77
|
end
|
@@ -73,18 +79,34 @@ class Beauvoir
|
|
73
79
|
end
|
74
80
|
|
75
81
|
def guess(name)
|
76
|
-
@names_genders.fetch(normalize(name), :unknown)
|
82
|
+
@names_genders.fetch(Beauvoir.normalize(name), :unknown)
|
83
|
+
end
|
84
|
+
|
85
|
+
def estimated_male_value(name)
|
86
|
+
if name_obj = @names_by_names[Beauvoir.normalize(name)]
|
87
|
+
name_obj.estimated_male_value
|
88
|
+
else
|
89
|
+
nil
|
90
|
+
end
|
91
|
+
end
|
92
|
+
|
93
|
+
def estimated_female_value(name)
|
94
|
+
if name_obj = @names_by_names[Beauvoir.normalize(name)]
|
95
|
+
name_obj.estimated_female_value
|
96
|
+
else
|
97
|
+
nil
|
98
|
+
end
|
77
99
|
end
|
78
100
|
|
79
|
-
def
|
80
|
-
if name_obj = @names_by_names[normalize(name)]
|
101
|
+
def male_proportion(name)
|
102
|
+
if name_obj = @names_by_names[Beauvoir.normalize(name)]
|
81
103
|
name_obj.male_proportion
|
82
104
|
else
|
83
105
|
nil
|
84
106
|
end
|
85
107
|
end
|
86
|
-
def
|
87
|
-
if name_obj = @names_by_names[normalize(name)]
|
108
|
+
def female_proportion(name)
|
109
|
+
if name_obj = @names_by_names[Beauvoir.normalize(name)]
|
88
110
|
name_obj.female_proportion
|
89
111
|
else
|
90
112
|
nil
|
data/lib/name.rb
CHANGED
@@ -3,13 +3,20 @@
|
|
3
3
|
class Name
|
4
4
|
attr_accessor :male_count, :female_count, :name
|
5
5
|
|
6
|
-
def initialize(name)
|
6
|
+
def initialize(name, options={})
|
7
|
+
# default_options = {
|
8
|
+
# :significance_level => 0.95,
|
9
|
+
# }
|
10
|
+
@options = options #default_options.merge(options)
|
11
|
+
|
7
12
|
@male_count = 0
|
8
13
|
@female_count = 0
|
9
14
|
@name = name
|
15
|
+
# @significance_level = @options[:significance_level]
|
10
16
|
end
|
11
17
|
|
12
|
-
def male?
|
18
|
+
def male?
|
19
|
+
#pure proportions, so even the slightest greater proportion of one gender will affect this
|
13
20
|
@male_count > @female_count
|
14
21
|
end
|
15
22
|
|
@@ -18,7 +25,13 @@ class Name
|
|
18
25
|
end
|
19
26
|
|
20
27
|
def gender
|
21
|
-
|
28
|
+
if female?
|
29
|
+
:female
|
30
|
+
elsif male?
|
31
|
+
:male
|
32
|
+
else
|
33
|
+
:unknown
|
34
|
+
end
|
22
35
|
end
|
23
36
|
|
24
37
|
def female_proportion
|
@@ -35,12 +48,48 @@ class Name
|
|
35
48
|
(@male_count + @female_count).to_f
|
36
49
|
end
|
37
50
|
|
38
|
-
|
39
|
-
#
|
51
|
+
|
52
|
+
# fancy statistics!
|
53
|
+
#------------------
|
54
|
+
# implements Agresti-Coull estimated value and binomial confidence interval
|
55
|
+
# via:
|
56
|
+
# - http://codesequoia.wordpress.com/2010/12/06/unit-test-and-statistics/
|
57
|
+
# - http://stackoverflow.com/questions/3749125/how-should-i-order-these-helpful-scores/3752941#3752941
|
58
|
+
# -
|
59
|
+
#this is apparently related to alpha and related to the level of statistical significance we care about.
|
60
|
+
# 1.96 pertains to a 0.95 significance level.
|
61
|
+
#
|
62
|
+
MAGIC_STATISTICS_NUMBER = 1.96
|
63
|
+
|
64
|
+
def z
|
65
|
+
# TODO: https://github.com/clbustos/statsample/blob/1168d58b14a5095af0a639b4843b31433d40f105/lib/statsample/srs.rb
|
66
|
+
#@significance_level #do stuff with this.
|
67
|
+
MAGIC_STATISTICS_NUMBER
|
68
|
+
end
|
69
|
+
|
70
|
+
def estimated_female_value
|
71
|
+
estimated_value_formula(@female_count)
|
72
|
+
end
|
73
|
+
|
74
|
+
def estimated_male_value
|
75
|
+
estimated_value_formula(@male_count)
|
76
|
+
end
|
77
|
+
|
78
|
+
def estimated_value
|
79
|
+
estimated_value_formula([@male_count, @female_count].max)
|
80
|
+
end
|
81
|
+
|
82
|
+
# returns lower bound of higher of male/female
|
40
83
|
def lower
|
41
|
-
nt = total +
|
42
|
-
|
43
|
-
|
44
|
-
|
84
|
+
nt = total + z ** 2
|
85
|
+
interval = z * Math.sqrt(estimated_value * (1 - estimated_value) / nt)
|
86
|
+
[female_proportion, male_proportion].max - interval
|
87
|
+
end
|
88
|
+
|
89
|
+
private
|
90
|
+
|
91
|
+
def estimated_value_formula(observed)
|
92
|
+
nt = total + z ** 2
|
93
|
+
(observed + ((z ** 2) / 2)) / nt
|
45
94
|
end
|
46
95
|
end
|
metadata
CHANGED
@@ -1,38 +1,33 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: beauvoir
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.2b
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jeremy B. Merrill
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2013-10-
|
11
|
+
date: 2013-10-26 00:00:00.000000000 Z
|
12
12
|
dependencies: []
|
13
|
-
description:
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
depends hundreds of person-hours of labor. There is a place
|
32
|
-
for more automated counts and Beauvoir can help, but if you
|
33
|
-
do a count like this, you should be careful in how you word
|
34
|
-
your findings not to misgender anyone in particular and be
|
35
|
-
responsive to the possibility of errors.
|
13
|
+
description: "Guess gender by a first name using more detailed, better\n sourced
|
14
|
+
data from Open Gender Tracker's Global Name Data.<br />\n Beauvoir
|
15
|
+
lets you set avg and lower bounds and choose\n countries from
|
16
|
+
which to draw data. \n\n\n It's important to note that many people
|
17
|
+
identify as neither\n a men nor a women. It's important, too,
|
18
|
+
to note that many\n people who do identify as male or female
|
19
|
+
have names that<br />\n are held by far more people who identify
|
20
|
+
as another gender.\n All of these people deserve not to be misgendered
|
21
|
+
in public\n (or in private).\n\n Nevertheless,
|
22
|
+
automatically classifying people by apparent\n gender can be
|
23
|
+
a very useful tool to perform censuses of\n communities or publications
|
24
|
+
to detect and quantify\n perhaps-invisible bias. VIDA is a pioneer
|
25
|
+
in this field,\n but their \"Count\" is limited by a manual methodology
|
26
|
+
that\n depends hundreds of person-hours of labor. There is a
|
27
|
+
place\n for more automated counts and Beauvoir can help, but
|
28
|
+
if you\n do a count like this, you should be careful in how you
|
29
|
+
word\n your findings not to misgender anyone in particular and
|
30
|
+
be\n responsive to the possibility of errors."
|
36
31
|
email: jeremybmerrill@jeremybmerrill.com
|
37
32
|
executables: []
|
38
33
|
extensions: []
|
@@ -42,8 +37,8 @@ files:
|
|
42
37
|
- lib/name.rb
|
43
38
|
- LICENSE
|
44
39
|
- README.md
|
45
|
-
- lib/data/ukprocessed.csv
|
46
40
|
- lib/data/usprocessed.csv
|
41
|
+
- lib/data/ukprocessed.csv
|
47
42
|
homepage: http://rubygems.org/gems/beauvoir
|
48
43
|
licenses:
|
49
44
|
- MIT
|
@@ -64,9 +59,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
64
59
|
version: 1.3.1
|
65
60
|
requirements: []
|
66
61
|
rubyforge_project:
|
67
|
-
rubygems_version: 2.0.
|
62
|
+
rubygems_version: 2.0.3
|
68
63
|
signing_key:
|
69
64
|
specification_version: 4
|
70
65
|
summary: Guess a person's gender by their first name
|
71
66
|
test_files: []
|
72
|
-
has_rdoc:
|