RubyGems - static_genderizer - Versions diffs - 0.1.0 - Mend

static_genderizer 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +7 -0
data/LICENSE +21 -0
data/README.md +111 -0
data/lib/static_genderizer/configuration.rb +17 -0
data/lib/static_genderizer/genderizer.rb +125 -0
data/lib/static_genderizer/loader.rb +75 -0
data/lib/static_genderizer/result.rb +27 -0
data/lib/static_genderizer/version.rb +3 -0
data/lib/static_genderizer.rb +40 -0
data/spec/data/en.csv +5 -0
data/spec/data/pl.csv +6 -0
metadata +96 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 95d856ce74201266c3d0e301c7f70eed4be07acee1d50689a536c8ac54549e6d
+  data.tar.gz: c7b7aeb2debfab6819bfea6a4d601e64e5fcf1231ec074e63ff4b7c9f44168dd
+SHA512:
+  metadata.gz: bebb76cdfc0af9e65466e53e790ccf8785c4fdf5c3214d4af1a2a67c5a2b1bcae76dae6ac04ee80434fef174d19813b809ebf2c9269cf98c14eb22bd3a0dcfa5
+  data.tar.gz: 2047928c2fb2d9a7214b52c3d45526907ee39e16afcc715adb14bc6974e720071f9be3ab0690d019d815f878767ae58a6bddf692015e25bfbbe570084096af7c

data/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 PadawanBreslau
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,111 @@
+# static_genderizer
+static_genderizer is a small Ruby library that detects probable gender and splits first/last name tokens using static CSV datasets loaded at startup. It's intentionally simple and fast — it uses in-memory lookups from CSV files that you provide.
+Key features
+- Load static per-language CSV files (name,gender) from a configurable data directory.
+- Classify tokens: tokens that have a gender value (M/F) are treated as first names; tokens without gender are treated as last names.
+- Analyze a name string and return an object with:
+  - first_names: Array of detected first-name tokens (preserve original token casing)
+  - last_names: Array of detected last-name tokens
+  - language: the language chosen for the result (symbol) or nil
+  - gender: one of :male, :female or :unknown
+- Configure the set of languages to load; if no language is requested, the gem searches across all loaded languages and selects the best match.
+Installation
+Add this gem to your Gemfile (local development):
+```ruby
+# Gemfile
+gem "static_genderizer"
+```
+Or build and install the gem locally:
+```bash
+gem build static_genderizer.gemspec
+gem install ./static_genderizer-0.1.0.gem
+```
+Configuration & CSV files
+- Provide CSV files named `xx.csv` (where `xx` is the language code, e.g. `pl`, `en`) in a data directory.
+- CSV format: the file must have headers and at least two columns: `name` and `gender`.
+  - `name`: the name token (e.g. "Anna", "Kowalski")
+  - `gender`: `M`, `F` (case-insensitive) or empty
+    - non-empty gender => token will be treated as a first name (and the gender recorded)
+    - empty gender => token will be treated as a last name
+- Example `spec/data/pl.csv`:
+```csv
+name,gender
+Jan,M
+Anna,F
+```
+Quickstart (programmatic)
+```ruby
+require "static_genderizer"
+# configure and load CSVs
+StaticGenderizer.configure do |c|
+  c.data_path = File.expand_path("data", __dir__) # path containing pl.csv, en.csv ...
+  c.languages = [:pl, :en]                              # languages to load
+  c.case_sensitive = false
+end
+# analyze a name (language optional)
+result = StaticGenderizer.analyze("Jan Kowalski", language: :pl)
+puts result.first_names.inspect  # => ["Jan"]
+puts result.last_names.inspect   # => ["Kowalski"]
+puts result.language             # => :pl
+puts result.gender               # => :male
+```
+Behavior & heuristics
+- Tokenization: input string is split on whitespace and punctuation (commas/semicolons). Apostrophes are preserved (e.g., "O'Connor").
+- Classification:
+  - If a token is present in a language's `first_names` map (i.e. found in CSV with non-empty gender) — it is treated as a first name for that language.
+  - Otherwise the token is treated as a last name.
+- Language selection:
+  - If you pass a `language:` that is loaded, analysis is done only for that language.
+  - If you do not pass a language, the gem analyzes across all configured languages and picks the language that produced the highest "match" score (based on tokens found).
+- Gender decision:
+  - Derived from detected first-name tokens' recorded genders (M/F). Majority rule applies; ties or no matches => `:unknown`.
+API
+- StaticGenderizer.configure { |c| ... } — configure data_path, languages, case_sensitive and load CSVs.
+- StaticGenderizer.analyze(name_string, language: nil) => returns a StaticGenderizer::Result
+  - Result provides `first_names`, `last_names`, `language`, and `gender`.
+  - Example: `StaticGenderizer.analyze("Anna Nowak")`
+Testing
+Specs are included under `spec/`. To run:
+```bash
+bundle install
+bundle exec rspec
+```
+Project layout (relevant)
+- lib/static_genderizer/*.rb — gem implementation
+- spec/ — RSpec tests and sample CSV data (spec/data/*.csv)
+- static_genderizer.gemspec, Gemfile, Rakefile
+Notes, limitations & next steps
+- CSV-only: this gem is designed for static CSV lookup. It does not use external services.
+- Declension: current implementation focuses on name splitting and gender detection using `xx.csv` files. If you need declension/inflection data (e.g., `xx_declination.csv`) or morphological support, this can be added — the codebase is structured so a `declinations` loader and inclusion in Result can be implemented.
+- Ambiguity & heuristics: heuristics are intentionally simple. For better accuracy you can:
+  - extend CSVs with frequency scores,
+  - add a `type` column (`first`, `last`, `both`),
+  - add language-specific rules or suffix heuristics.
+- Case sensitivity is configurable via `configuration.case_sensitive`.
+Contributing
+- Add CSV data under `spec/data` for tests or in your own `data/` folder when using the gem.
+- Add specs under `spec/` and run `bundle exec rspec`.
+- Open pull requests for bug fixes or improvements.
+License
+MIT — see LICENSE file.
+Contact / support
+Open an issue or PR in the repository where this gem is hosted.

data/lib/static_genderizer/configuration.rb ADDED Viewed

@@ -0,0 +1,17 @@
+# frozen_string_literal: true
+module StaticGenderizer
+  # Holds configuration for the gem.
+  class Configuration
+    attr_accessor :data_path, :languages, :case_sensitive
+    def initialize
+      # folder where CSVs will be looked up (must be set by user typically)
+      @data_path = File.join(Dir.pwd, "data")
+      # languages to load as array of symbols/strings; default empty (no default language)
+      @languages = []
+      # token matching case-sensitivity
+      @case_sensitive = false
+    end
+  end
+end

data/lib/static_genderizer/genderizer.rb ADDED Viewed

@@ -0,0 +1,125 @@
+  # frozen_string_literal: true
+require "set"
+module StaticGenderizer
+  # Analyze a name string and decide first/last tokens and gender.
+  #
+  # Rules implemented:
+  # - Tokenize input into tokens
+  # - For the requested search set of languages:
+  #     * If a token exists in a language's first_names table => treat token as first name for that language
+  #     * Otherwise => treat token as last name for that language
+  # - To produce a single language in the result: pick the language with the most token matches
+  # - Gender is decided from the detected first-name tokens across the chosen language: majority rule from recorded genders (M/F), otherwise :unknown
+  #
+  class Genderizer
+    attr_reader :loader
+    def initialize(loader)
+      @loader = loader
+    end
+    # name_string: String
+    # language: optional symbol/string. If provided and is among loaded languages -> analyze only that language.
+    #           otherwise analyze across all loaded languages.
+    def analyze(name_string, language: nil)
+      tokens = tokenize(name_string)
+      return Result.new(first_names: [], last_names: [], language: nil, gender: :unknown) if tokens.empty?
+      # Determine languages to search
+      loaded = loader.languages
+      requested = language ? language.to_s.downcase.to_sym : nil
+      search_langs =
+        if requested && loaded.include?(requested)
+          [requested]
+        else
+          loaded
+        end
+      # For each language compute classification and matches count
+      lang_results = {}
+      search_langs.each do |lang|
+        firsts = []
+        lasts = []
+        matches = 0
+        tokens.each do |tok|
+          nt = normalize_token(tok)
+          if loader.first_names[lang].key?(nt)
+            firsts << tok
+            matches += 1
+          else
+            # everything that isn't a known first name becomes a last name
+            lasts << tok
+            # count last-name matches if token exists in last_names (helps scoring)
+            matches += 1 if loader.last_names[lang].include?(nt)
+          end
+        end
+        lang_results[lang] = { firsts: firsts, lasts: lasts, matches: matches }
+      end
+      # Choose best language by matches (highest). If tie, prefer the requested language if present.
+      best_lang, best_data = nil, nil
+      lang_results.each do |lang, data|
+        if best_lang.nil? || data[:matches] > best_data[:matches] ||
+           (data[:matches] == best_data[:matches] && language && lang == requested)
+          best_lang = lang
+          best_data = data
+        end
+      end
+      # If no language had any matches (all matches 0), we still pick first searched language (if any)
+      if best_lang.nil? && !search_langs.empty?
+        best_lang = search_langs.first
+        best_data = lang_results[best_lang]
+      end
+      # Decide gender from first_names in chosen language
+      genders = []
+      best_data[:firsts].each do |fn|
+        nt = normalize_token(fn)
+        recorded = loader.first_names[best_lang][nt]
+        genders.concat(Array(recorded)) if recorded
+      end
+      gender = decide_gender_from_list(genders)
+      Result.new(first_names: best_data[:firsts], last_names: best_data[:lasts], language: best_lang, gender: gender)
+    end
+    private
+    def tokenize(name_string)
+      return [] if name_string.nil?
+      # Split on whitespace and commas/semicolons. Keep apostrophes in tokens.
+      name_string.strip.split(/[\s,;]+/).map(&:strip).reject(&:empty?)
+    end
+    def normalize_token(token)
+      t = token.to_s.strip
+      t = t.downcase unless loader.config.case_sensitive
+      t
+    end
+    def decide_gender_from_list(genders)
+      return :unknown if genders.empty?
+      counts = Hash.new(0)
+      genders.each do |g|
+        g_u = g.to_s.upcase
+        counts[g_u] += 1
+      end
+      m = counts["M"]
+      f = counts["F"]
+      if m > f
+        :male
+      elsif f > m
+        :female
+      else
+        :unknown
+      end
+    end
+  end
+end

data/lib/static_genderizer/loader.rb ADDED Viewed

@@ -0,0 +1,75 @@
+# frozen_string_literal: true
+require "set"
+require "csv"
+module StaticGenderizer
+  # Loads per-language CSVs into in-memory lookup tables.
+  #
+  # Expected CSV format: headers name,gender
+  # - if gender is present and equals 'M' or 'F' (case-insensitive) -> treat as first name entry
+  # - if gender empty -> treat as last name entry
+  #
+  class Loader
+    attr_reader :config, :languages, :first_names, :last_names
+    def initialize(config)
+      @config = config
+      @languages = []
+      # first_names[lang] => { normalized_name => [genders...] }
+      @first_names = Hash.new { |h, k| h[k] = {} }
+      # last_names[lang] => Set of normalized last names
+      @last_names = Hash.new { |h, k| h[k] = Set.new }
+    end
+    # Load all configured languages (config.languages). Normalizes languages to symbols.
+    def load_all_languages
+      @languages = Array(config.languages).map { |l| l.to_s.downcase.to_sym }
+      @languages.each do |lang|
+        load_language(lang)
+      end
+    end
+    # Loads a specific language (by symbol or string)
+    def load_language(lang)
+      lang = lang.to_s.downcase.to_sym
+      return unless config.languages.map { |l| l.to_s.downcase.to_sym }.include?(lang)
+      base = File.join(config.data_path, lang.to_s)
+      csv_file = base + ".csv"
+      if File.exist?(csv_file)
+        load_names_csv(lang, csv_file)
+      else
+        warn "StaticGenderizer: CSV not found for language #{lang} at #{csv_file}"
+      end
+    end
+    private
+    def normalize_token(token)
+      t = token.to_s.strip
+      t = t.downcase unless config.case_sensitive
+      t
+    end
+    def load_names_csv(lang, path)
+      CSV.foreach(path, headers: true).with_index(1) do |row, idx|
+        name = row['name'].to_s.strip
+        next if name.empty?
+        gender_raw = row['gender'].to_s.strip
+        token = normalize_token(name)
+        if gender_raw.to_s.strip.empty?
+          last_names[lang] << token
+        else
+          g = gender_raw[0].upcase
+          g = (g == "M" ? "M" : (g == "F" ? "F" : "U"))
+          (first_names[lang][token] ||= []) << g
+        end
+      end
+    rescue CSV::MalformedCSVError => e
+      warn "StaticGenderizer: malformed CSV #{path}: #{e.message}"
+    end
+  end
+end

data/lib/static_genderizer/result.rb ADDED Viewed

@@ -0,0 +1,27 @@
+# frozen_string_literal: true
+module StaticGenderizer
+  # Simple result object returned by analyze
+  class Result
+    attr_reader :first_names, :last_names, :language, :gender
+    # first_names, last_names: arrays of original tokens (preserve original casing)
+    # language: symbol or nil (language that best matched)
+    # gender: :male, :female, :unknown
+    def initialize(first_names:, last_names:, language:, gender:)
+      @first_names = Array(first_names)
+      @last_names = Array(last_names)
+      @language = language
+      @gender = gender
+    end
+    def to_h
+      {
+        first_names: first_names,
+        last_names: last_names,
+        language: language,
+        gender: gender
+      }
+    end
+  end
+end

data/lib/static_genderizer/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module StaticGenderizer
+  VERSION = "0.1.0"
+end

data/lib/static_genderizer.rb ADDED Viewed

@@ -0,0 +1,40 @@
+# frozen_string_literal: true
+require "csv"
+require "set"
+require 'pry'
+require_relative "static_genderizer/version"
+require_relative "static_genderizer/configuration"
+require_relative "static_genderizer/loader"
+require_relative "static_genderizer/genderizer"
+require_relative "static_genderizer/result"
+module StaticGenderizer
+  class << self
+    # Configure and load CSVs
+    def configure
+      yield configuration if block_given?
+      loader.load_all_languages
+    end
+    def configuration
+      @configuration ||= Configuration.new
+    end
+    def loader
+      @loader ||= Loader.new(configuration)
+    end
+    def genderizer
+      @genderizer ||= Genderizer.new(loader)
+    end
+    # analyze(name_string, language: nil)
+    # - if language is provided and loaded -> analyze only in that language
+    # - otherwise analyze across all loaded languages
+    def analyze(name_string, language: nil)
+      genderizer.analyze(name_string, language: language)
+    end
+  end
+end

data/spec/data/en.csv ADDED Viewed

@@ -0,0 +1,5 @@
+name,gender
+John,M
+Mary,F
+Smith,
+O'Connor,

data/spec/data/pl.csv ADDED Viewed

@@ -0,0 +1,6 @@
+name,gender
+Andrzej,M
+Anna,F
+Błażej,M
+Czesława,F
+Jan,M

metadata ADDED Viewed

@@ -0,0 +1,96 @@
+--- !ruby/object:Gem::Specification
+name: static_genderizer
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- You
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2025-12-02 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: pry
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.14'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.14'
+description: Detect probable gender and split first/last name tokens using static
+  CSV datasets.
+email:
+- you@example.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- LICENSE
+- README.md
+- lib/static_genderizer.rb
+- lib/static_genderizer/configuration.rb
+- lib/static_genderizer/genderizer.rb
+- lib/static_genderizer/loader.rb
+- lib/static_genderizer/result.rb
+- lib/static_genderizer/version.rb
+- spec/data/en.csv
+- spec/data/pl.csv
+homepage: ''
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '2.7'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.4.10
+signing_key:
+specification_version: 4
+summary: Name gender detection using static CSV files per-language
+test_files: []