static_genderizer 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 95d856ce74201266c3d0e301c7f70eed4be07acee1d50689a536c8ac54549e6d
4
+ data.tar.gz: c7b7aeb2debfab6819bfea6a4d601e64e5fcf1231ec074e63ff4b7c9f44168dd
5
+ SHA512:
6
+ metadata.gz: bebb76cdfc0af9e65466e53e790ccf8785c4fdf5c3214d4af1a2a67c5a2b1bcae76dae6ac04ee80434fef174d19813b809ebf2c9269cf98c14eb22bd3a0dcfa5
7
+ data.tar.gz: 2047928c2fb2d9a7214b52c3d45526907ee39e16afcc715adb14bc6974e720071f9be3ab0690d019d815f878767ae58a6bddf692015e25bfbbe570084096af7c
data/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 PadawanBreslau
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,111 @@
1
+ # static_genderizer
2
+
3
+ static_genderizer is a small Ruby library that detects probable gender and splits first/last name tokens using static CSV datasets loaded at startup. It's intentionally simple and fast — it uses in-memory lookups from CSV files that you provide.
4
+
5
+ Key features
6
+ - Load static per-language CSV files (name,gender) from a configurable data directory.
7
+ - Classify tokens: tokens that have a gender value (M/F) are treated as first names; tokens without gender are treated as last names.
8
+ - Analyze a name string and return an object with:
9
+ - first_names: Array of detected first-name tokens (preserve original token casing)
10
+ - last_names: Array of detected last-name tokens
11
+ - language: the language chosen for the result (symbol) or nil
12
+ - gender: one of :male, :female or :unknown
13
+ - Configure the set of languages to load; if no language is requested, the gem searches across all loaded languages and selects the best match.
14
+
15
+ Installation
16
+ Add this gem to your Gemfile (local development):
17
+ ```ruby
18
+ # Gemfile
19
+ gem "static_genderizer"
20
+ ```
21
+
22
+ Or build and install the gem locally:
23
+ ```bash
24
+ gem build static_genderizer.gemspec
25
+ gem install ./static_genderizer-0.1.0.gem
26
+ ```
27
+
28
+ Configuration & CSV files
29
+ - Provide CSV files named `xx.csv` (where `xx` is the language code, e.g. `pl`, `en`) in a data directory.
30
+ - CSV format: the file must have headers and at least two columns: `name` and `gender`.
31
+ - `name`: the name token (e.g. "Anna", "Kowalski")
32
+ - `gender`: `M`, `F` (case-insensitive) or empty
33
+ - non-empty gender => token will be treated as a first name (and the gender recorded)
34
+ - empty gender => token will be treated as a last name
35
+ - Example `spec/data/pl.csv`:
36
+
37
+ ```csv
38
+ name,gender
39
+ Jan,M
40
+ Anna,F
41
+ ```
42
+
43
+ Quickstart (programmatic)
44
+ ```ruby
45
+ require "static_genderizer"
46
+
47
+ # configure and load CSVs
48
+ StaticGenderizer.configure do |c|
49
+ c.data_path = File.expand_path("data", __dir__) # path containing pl.csv, en.csv ...
50
+ c.languages = [:pl, :en] # languages to load
51
+ c.case_sensitive = false
52
+ end
53
+
54
+ # analyze a name (language optional)
55
+ result = StaticGenderizer.analyze("Jan Kowalski", language: :pl)
56
+
57
+ puts result.first_names.inspect # => ["Jan"]
58
+ puts result.last_names.inspect # => ["Kowalski"]
59
+ puts result.language # => :pl
60
+ puts result.gender # => :male
61
+ ```
62
+
63
+ Behavior & heuristics
64
+ - Tokenization: input string is split on whitespace and punctuation (commas/semicolons). Apostrophes are preserved (e.g., "O'Connor").
65
+ - Classification:
66
+ - If a token is present in a language's `first_names` map (i.e. found in CSV with non-empty gender) — it is treated as a first name for that language.
67
+ - Otherwise the token is treated as a last name.
68
+ - Language selection:
69
+ - If you pass a `language:` that is loaded, analysis is done only for that language.
70
+ - If you do not pass a language, the gem analyzes across all configured languages and picks the language that produced the highest "match" score (based on tokens found).
71
+ - Gender decision:
72
+ - Derived from detected first-name tokens' recorded genders (M/F). Majority rule applies; ties or no matches => `:unknown`.
73
+
74
+ API
75
+ - StaticGenderizer.configure { |c| ... } — configure data_path, languages, case_sensitive and load CSVs.
76
+ - StaticGenderizer.analyze(name_string, language: nil) => returns a StaticGenderizer::Result
77
+ - Result provides `first_names`, `last_names`, `language`, and `gender`.
78
+ - Example: `StaticGenderizer.analyze("Anna Nowak")`
79
+
80
+ Testing
81
+ Specs are included under `spec/`. To run:
82
+ ```bash
83
+ bundle install
84
+ bundle exec rspec
85
+ ```
86
+
87
+ Project layout (relevant)
88
+ - lib/static_genderizer/*.rb — gem implementation
89
+ - spec/ — RSpec tests and sample CSV data (spec/data/*.csv)
90
+ - static_genderizer.gemspec, Gemfile, Rakefile
91
+
92
+ Notes, limitations & next steps
93
+ - CSV-only: this gem is designed for static CSV lookup. It does not use external services.
94
+ - Declension: current implementation focuses on name splitting and gender detection using `xx.csv` files. If you need declension/inflection data (e.g., `xx_declination.csv`) or morphological support, this can be added — the codebase is structured so a `declinations` loader and inclusion in Result can be implemented.
95
+ - Ambiguity & heuristics: heuristics are intentionally simple. For better accuracy you can:
96
+ - extend CSVs with frequency scores,
97
+ - add a `type` column (`first`, `last`, `both`),
98
+ - add language-specific rules or suffix heuristics.
99
+ - Case sensitivity is configurable via `configuration.case_sensitive`.
100
+
101
+ Contributing
102
+ - Add CSV data under `spec/data` for tests or in your own `data/` folder when using the gem.
103
+ - Add specs under `spec/` and run `bundle exec rspec`.
104
+ - Open pull requests for bug fixes or improvements.
105
+
106
+ License
107
+ MIT — see LICENSE file.
108
+
109
+ Contact / support
110
+ Open an issue or PR in the repository where this gem is hosted.
111
+
@@ -0,0 +1,17 @@
1
+ # frozen_string_literal: true
2
+
3
+ module StaticGenderizer
4
+ # Holds configuration for the gem.
5
+ class Configuration
6
+ attr_accessor :data_path, :languages, :case_sensitive
7
+
8
+ def initialize
9
+ # folder where CSVs will be looked up (must be set by user typically)
10
+ @data_path = File.join(Dir.pwd, "data")
11
+ # languages to load as array of symbols/strings; default empty (no default language)
12
+ @languages = []
13
+ # token matching case-sensitivity
14
+ @case_sensitive = false
15
+ end
16
+ end
17
+ end
@@ -0,0 +1,125 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "set"
4
+
5
+ module StaticGenderizer
6
+ # Analyze a name string and decide first/last tokens and gender.
7
+ #
8
+ # Rules implemented:
9
+ # - Tokenize input into tokens
10
+ # - For the requested search set of languages:
11
+ # * If a token exists in a language's first_names table => treat token as first name for that language
12
+ # * Otherwise => treat token as last name for that language
13
+ # - To produce a single language in the result: pick the language with the most token matches
14
+ # - Gender is decided from the detected first-name tokens across the chosen language: majority rule from recorded genders (M/F), otherwise :unknown
15
+ #
16
+ class Genderizer
17
+ attr_reader :loader
18
+
19
+ def initialize(loader)
20
+ @loader = loader
21
+ end
22
+
23
+ # name_string: String
24
+ # language: optional symbol/string. If provided and is among loaded languages -> analyze only that language.
25
+ # otherwise analyze across all loaded languages.
26
+ def analyze(name_string, language: nil)
27
+ tokens = tokenize(name_string)
28
+ return Result.new(first_names: [], last_names: [], language: nil, gender: :unknown) if tokens.empty?
29
+
30
+ # Determine languages to search
31
+ loaded = loader.languages
32
+ requested = language ? language.to_s.downcase.to_sym : nil
33
+ search_langs =
34
+ if requested && loaded.include?(requested)
35
+ [requested]
36
+ else
37
+ loaded
38
+ end
39
+
40
+ # For each language compute classification and matches count
41
+ lang_results = {}
42
+ search_langs.each do |lang|
43
+ firsts = []
44
+ lasts = []
45
+ matches = 0
46
+
47
+ tokens.each do |tok|
48
+ nt = normalize_token(tok)
49
+ if loader.first_names[lang].key?(nt)
50
+ firsts << tok
51
+ matches += 1
52
+ else
53
+ # everything that isn't a known first name becomes a last name
54
+ lasts << tok
55
+ # count last-name matches if token exists in last_names (helps scoring)
56
+ matches += 1 if loader.last_names[lang].include?(nt)
57
+ end
58
+ end
59
+
60
+ lang_results[lang] = { firsts: firsts, lasts: lasts, matches: matches }
61
+ end
62
+
63
+ # Choose best language by matches (highest). If tie, prefer the requested language if present.
64
+ best_lang, best_data = nil, nil
65
+ lang_results.each do |lang, data|
66
+ if best_lang.nil? || data[:matches] > best_data[:matches] ||
67
+ (data[:matches] == best_data[:matches] && language && lang == requested)
68
+ best_lang = lang
69
+ best_data = data
70
+ end
71
+ end
72
+
73
+ # If no language had any matches (all matches 0), we still pick first searched language (if any)
74
+ if best_lang.nil? && !search_langs.empty?
75
+ best_lang = search_langs.first
76
+ best_data = lang_results[best_lang]
77
+ end
78
+
79
+ # Decide gender from first_names in chosen language
80
+ genders = []
81
+ best_data[:firsts].each do |fn|
82
+ nt = normalize_token(fn)
83
+ recorded = loader.first_names[best_lang][nt]
84
+ genders.concat(Array(recorded)) if recorded
85
+ end
86
+ gender = decide_gender_from_list(genders)
87
+
88
+ Result.new(first_names: best_data[:firsts], last_names: best_data[:lasts], language: best_lang, gender: gender)
89
+ end
90
+
91
+ private
92
+
93
+ def tokenize(name_string)
94
+ return [] if name_string.nil?
95
+ # Split on whitespace and commas/semicolons. Keep apostrophes in tokens.
96
+ name_string.strip.split(/[\s,;]+/).map(&:strip).reject(&:empty?)
97
+ end
98
+
99
+ def normalize_token(token)
100
+ t = token.to_s.strip
101
+ t = t.downcase unless loader.config.case_sensitive
102
+ t
103
+ end
104
+
105
+ def decide_gender_from_list(genders)
106
+ return :unknown if genders.empty?
107
+
108
+ counts = Hash.new(0)
109
+ genders.each do |g|
110
+ g_u = g.to_s.upcase
111
+ counts[g_u] += 1
112
+ end
113
+
114
+ m = counts["M"]
115
+ f = counts["F"]
116
+ if m > f
117
+ :male
118
+ elsif f > m
119
+ :female
120
+ else
121
+ :unknown
122
+ end
123
+ end
124
+ end
125
+ end
@@ -0,0 +1,75 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "set"
4
+ require "csv"
5
+
6
+ module StaticGenderizer
7
+ # Loads per-language CSVs into in-memory lookup tables.
8
+ #
9
+ # Expected CSV format: headers name,gender
10
+ # - if gender is present and equals 'M' or 'F' (case-insensitive) -> treat as first name entry
11
+ # - if gender empty -> treat as last name entry
12
+ #
13
+ class Loader
14
+ attr_reader :config, :languages, :first_names, :last_names
15
+
16
+ def initialize(config)
17
+ @config = config
18
+ @languages = []
19
+ # first_names[lang] => { normalized_name => [genders...] }
20
+ @first_names = Hash.new { |h, k| h[k] = {} }
21
+ # last_names[lang] => Set of normalized last names
22
+ @last_names = Hash.new { |h, k| h[k] = Set.new }
23
+ end
24
+
25
+ # Load all configured languages (config.languages). Normalizes languages to symbols.
26
+ def load_all_languages
27
+ @languages = Array(config.languages).map { |l| l.to_s.downcase.to_sym }
28
+ @languages.each do |lang|
29
+ load_language(lang)
30
+ end
31
+ end
32
+
33
+ # Loads a specific language (by symbol or string)
34
+ def load_language(lang)
35
+ lang = lang.to_s.downcase.to_sym
36
+ return unless config.languages.map { |l| l.to_s.downcase.to_sym }.include?(lang)
37
+
38
+ base = File.join(config.data_path, lang.to_s)
39
+ csv_file = base + ".csv"
40
+
41
+ if File.exist?(csv_file)
42
+ load_names_csv(lang, csv_file)
43
+ else
44
+ warn "StaticGenderizer: CSV not found for language #{lang} at #{csv_file}"
45
+ end
46
+ end
47
+
48
+ private
49
+
50
+ def normalize_token(token)
51
+ t = token.to_s.strip
52
+ t = t.downcase unless config.case_sensitive
53
+ t
54
+ end
55
+
56
+ def load_names_csv(lang, path)
57
+ CSV.foreach(path, headers: true).with_index(1) do |row, idx|
58
+ name = row['name'].to_s.strip
59
+ next if name.empty?
60
+ gender_raw = row['gender'].to_s.strip
61
+
62
+ token = normalize_token(name)
63
+ if gender_raw.to_s.strip.empty?
64
+ last_names[lang] << token
65
+ else
66
+ g = gender_raw[0].upcase
67
+ g = (g == "M" ? "M" : (g == "F" ? "F" : "U"))
68
+ (first_names[lang][token] ||= []) << g
69
+ end
70
+ end
71
+ rescue CSV::MalformedCSVError => e
72
+ warn "StaticGenderizer: malformed CSV #{path}: #{e.message}"
73
+ end
74
+ end
75
+ end
@@ -0,0 +1,27 @@
1
+ # frozen_string_literal: true
2
+
3
+ module StaticGenderizer
4
+ # Simple result object returned by analyze
5
+ class Result
6
+ attr_reader :first_names, :last_names, :language, :gender
7
+
8
+ # first_names, last_names: arrays of original tokens (preserve original casing)
9
+ # language: symbol or nil (language that best matched)
10
+ # gender: :male, :female, :unknown
11
+ def initialize(first_names:, last_names:, language:, gender:)
12
+ @first_names = Array(first_names)
13
+ @last_names = Array(last_names)
14
+ @language = language
15
+ @gender = gender
16
+ end
17
+
18
+ def to_h
19
+ {
20
+ first_names: first_names,
21
+ last_names: last_names,
22
+ language: language,
23
+ gender: gender
24
+ }
25
+ end
26
+ end
27
+ end
@@ -0,0 +1,3 @@
1
+ module StaticGenderizer
2
+ VERSION = "0.1.0"
3
+ end
@@ -0,0 +1,40 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "csv"
4
+ require "set"
5
+ require 'pry'
6
+
7
+ require_relative "static_genderizer/version"
8
+ require_relative "static_genderizer/configuration"
9
+ require_relative "static_genderizer/loader"
10
+ require_relative "static_genderizer/genderizer"
11
+ require_relative "static_genderizer/result"
12
+
13
+ module StaticGenderizer
14
+ class << self
15
+ # Configure and load CSVs
16
+ def configure
17
+ yield configuration if block_given?
18
+ loader.load_all_languages
19
+ end
20
+
21
+ def configuration
22
+ @configuration ||= Configuration.new
23
+ end
24
+
25
+ def loader
26
+ @loader ||= Loader.new(configuration)
27
+ end
28
+
29
+ def genderizer
30
+ @genderizer ||= Genderizer.new(loader)
31
+ end
32
+
33
+ # analyze(name_string, language: nil)
34
+ # - if language is provided and loaded -> analyze only in that language
35
+ # - otherwise analyze across all loaded languages
36
+ def analyze(name_string, language: nil)
37
+ genderizer.analyze(name_string, language: language)
38
+ end
39
+ end
40
+ end
data/spec/data/en.csv ADDED
@@ -0,0 +1,5 @@
1
+ name,gender
2
+ John,M
3
+ Mary,F
4
+ Smith,
5
+ O'Connor,
data/spec/data/pl.csv ADDED
@@ -0,0 +1,6 @@
1
+ name,gender
2
+ Andrzej,M
3
+ Anna,F
4
+ Błażej,M
5
+ Czesława,F
6
+ Jan,M
metadata ADDED
@@ -0,0 +1,96 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: static_genderizer
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - You
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2025-12-02 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: rake
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rspec
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: pry
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: '0.14'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '0.14'
55
+ description: Detect probable gender and split first/last name tokens using static
56
+ CSV datasets.
57
+ email:
58
+ - you@example.com
59
+ executables: []
60
+ extensions: []
61
+ extra_rdoc_files: []
62
+ files:
63
+ - LICENSE
64
+ - README.md
65
+ - lib/static_genderizer.rb
66
+ - lib/static_genderizer/configuration.rb
67
+ - lib/static_genderizer/genderizer.rb
68
+ - lib/static_genderizer/loader.rb
69
+ - lib/static_genderizer/result.rb
70
+ - lib/static_genderizer/version.rb
71
+ - spec/data/en.csv
72
+ - spec/data/pl.csv
73
+ homepage: ''
74
+ licenses:
75
+ - MIT
76
+ metadata: {}
77
+ post_install_message:
78
+ rdoc_options: []
79
+ require_paths:
80
+ - lib
81
+ required_ruby_version: !ruby/object:Gem::Requirement
82
+ requirements:
83
+ - - ">="
84
+ - !ruby/object:Gem::Version
85
+ version: '2.7'
86
+ required_rubygems_version: !ruby/object:Gem::Requirement
87
+ requirements:
88
+ - - ">="
89
+ - !ruby/object:Gem::Version
90
+ version: '0'
91
+ requirements: []
92
+ rubygems_version: 3.4.10
93
+ signing_key:
94
+ specification_version: 4
95
+ summary: Name gender detection using static CSV files per-language
96
+ test_files: []