cologne_phonetics 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 34b708dd44c7f1aea872b47dfa5db5f9d3ad1b91
4
+ data.tar.gz: 939eaec4a50c1c3e9fff629ee1b52b2f0ca1470d
5
+ SHA512:
6
+ metadata.gz: '0948c7a91ec2a0764ea8c06d7a8c9a2c9fab2acebdc5cae1cafdf07283dbd8813114513d94b1743e2c7f0c5a85ca416e7e73234d953707ba30bbada050ed5c52'
7
+ data.tar.gz: 741a913e92789e62e6d1335f941954f915e2a5fab9155801585d661b68426274492de36af677ca2c3fe7b4a25e64a4174e220f8388b873c48099486c186cdea3
data/CHANGELOG.md ADDED
@@ -0,0 +1,9 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
6
+
7
+ ## [1.0.0] – 2018-03-06
8
+ - Initial release.
9
+
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2017 Stefan Daschek
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,67 @@
1
+ # ColognePhonetics
2
+
3
+ The [“Cologne phonetics (Kölner Phonetik)”](https://en.wikipedia.org/wiki/Cologne_phonetics) algorithm encodes words in a way that enables to search for similarly sounding words. It’s related to the [“Soundex”](https://en.wikipedia.org/wiki/Soundex) algorithm, but better suited for the German language.
4
+
5
+ This implementations closely follows the algorithm as described on its Wikipedia page. Support for umlauts (Ä, Ö, Ü) and ß has been added as suggested there.
6
+
7
+ Note that *other accented characters are not handled*. If your data may contain such characters you need to preprocess it (for example by using [`I18n.transliterate`](http://www.rubydoc.info/gems/i18n/I18n/Base#transliterate-instance_method)).
8
+
9
+ ## Status
10
+
11
+ I consider this gem to be stable and (more or less) finished.
12
+
13
+ ## Usage
14
+
15
+ Example usage:
16
+
17
+ ```ruby
18
+ ColognePhonetics.encode('Wikipedia') # => "3412"
19
+
20
+ # Only basic characters and äöüß are handled, everything else gets ignored:
21
+ ColognePhonetics.encode('Åè1%-') # => ""
22
+
23
+ # If a string contains words separated by spaces, each word is encoded separately:
24
+ ColognePhonetics.encode('Heinz Classen') # => "068 4586"
25
+
26
+ # Use `encode_word` if you want to ignore spaces (note that this usually gives
27
+ # different results that using `encode` and removing spaces afterwards; see
28
+ # Wikipedia article for details):
29
+ ColognePhonetics.encode_word('Heinz Classen') # => "068586"
30
+ ```
31
+
32
+ You can set `ColognePhonetics.debug = true` to get warnings printed to `$stderr` about characters that can not be encoded:
33
+
34
+ ```ruby
35
+ ColognePhonetics.debug = true
36
+ ColognePhonetics.encode('Olé')
37
+ # Cologne Phonetics: No rule for 'é' (prev: 'l', next: '')
38
+ # => "05"
39
+ ```
40
+
41
+ ## Installation
42
+
43
+ Add this line to your application's Gemfile:
44
+
45
+ ```ruby
46
+ gem 'cologne_phonetics'
47
+ ```
48
+
49
+ And then execute:
50
+
51
+ $ bundle
52
+
53
+ Or install it yourself as:
54
+
55
+ $ gem install cologne_phonetics
56
+
57
+ ## Development
58
+
59
+ After checking out the repo, run `bin/setup` to install dependencies. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
60
+
61
+ ## Contributing
62
+
63
+ Bug reports and pull requests are welcome on GitHub at https://github.com/noniq/cologne_phonetics. Please make sure to include tests, and check that running `bin/rubocop` does not show any warnings.
64
+
65
+ ## License
66
+
67
+ The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
@@ -0,0 +1,53 @@
1
+ # frozen_string_literal: true
2
+
3
+ module ColognePhonetics
4
+ # @api private
5
+ module Rules
6
+ def self.define(&block)
7
+ @rules = DSL.new(&block).rules
8
+ end
9
+
10
+ def self.apply_to(string)
11
+ string = string.downcase.tr('ÄÖÜ', 'äöü') # Ruby < 2.3 downcases ASCII characters only
12
+ chars = [nil] + string.chars + [nil]
13
+ chars.each_cons(3).map{ |prev_char, char, next_char|
14
+ code_for(prev_char, char, next_char)
15
+ }.join
16
+ end
17
+
18
+ def self.code_for(prev_char, char, next_char)
19
+ @rules.each do |matcher, code|
20
+ return code if matcher.call(prev_char, char, next_char)
21
+ end
22
+ debug_info "Cologne Phonetics: No rule for '#{char}' (prev: '#{prev_char}', next: '#{next_char}')"
23
+ nil
24
+ end
25
+
26
+ def self.debug_info(message)
27
+ return unless ColognePhonetics.debug
28
+ $stderr.puts message # rubocop:disable StderrPuts
29
+ end
30
+
31
+ class DSL
32
+ attr_reader :rules
33
+
34
+ def initialize(&block)
35
+ @rules = []
36
+ instance_exec(&block)
37
+ end
38
+
39
+ def change(chars, to:, before: nil, not_before: nil, after: nil, not_after: nil, initial: nil)
40
+ matcher = ->(prev_char, char, next_char){
41
+ return unless chars.include?(char)
42
+ return if initial && prev_char
43
+ return if before && (!next_char || !before.include?(next_char))
44
+ return if not_before && next_char && not_before.include?(next_char)
45
+ return if after && (!prev_char || !after.include?(prev_char))
46
+ return if not_after && prev_char && not_after.include?(prev_char)
47
+ true
48
+ }
49
+ @rules << [matcher, to]
50
+ end
51
+ end
52
+ end
53
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ module ColognePhonetics
4
+ VERSION = '1.0.0'
5
+ end
@@ -0,0 +1,58 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'cologne_phonetics/rules'
4
+ require 'cologne_phonetics/version'
5
+
6
+ module ColognePhonetics
7
+ class << self
8
+ # Enable / disable debug mode. If set to true, using {.encode} or {.encode_word} will output
9
+ # warnings to `$stderr` if they encounter characters that cannot be encoded.
10
+ attr_accessor :debug
11
+ end
12
+
13
+ # rubocop:disable SpaceBeforeComma
14
+ Rules.define do
15
+ change 'aeijouy', to: '0'
16
+ change 'äöü' , to: '0' # additional rule: treat umlauts like vowels
17
+ change 'h' , to: ''
18
+ change 'b' , to: '1'
19
+ change 'p' , to: '1', not_before: 'h'
20
+ change 'dt' , to: '2', not_before: 'csz'
21
+ change 'fvw' , to: '3'
22
+ change 'p' , to: '3', before: 'h'
23
+ change 'gkq' , to: '4'
24
+ change 'c' , to: '4', initial: true, before: 'ahkloqrux'
25
+ change 'c' , to: '4', before: 'ahkoqux', not_after: 'sz'
26
+ change 'x' , to: '48', not_after: 'ckq'
27
+ change 'l' , to: '5'
28
+ change 'mn' , to: '6'
29
+ change 'r' , to: '7'
30
+ change 'sz' , to: '8'
31
+ change 'ß' , to: '8' # additional rule: treat 'ß' like 's'
32
+ change 'c' , to: '8', after: 'sz'
33
+ change 'c' , to: '8', initial: true, not_before: 'ahkloqrux'
34
+ change 'c' , to: '8', not_before: 'ahkoqux'
35
+ change 'dt' , to: '8', before: 'csz'
36
+ change 'x' , to: '8', after: 'ckq'
37
+ end
38
+ # rubocop:enable SpaceBeforeComma
39
+
40
+ # Encode string using Cologne phonetics rules. The encoding process can handle upper and lower case
41
+ # characters in the range of `a–z`, as well as `äöüß`. Everything else is ignored.
42
+ #
43
+ # If the string consists of several words separated by spaces, each word is encoded seperately,
44
+ # and the resulting codes are then again joined together with spaces.
45
+ #
46
+ # @return [String] Encoded string (consist of digits only, and maybe spaces)
47
+ def self.encode(string)
48
+ string.split(' ').map{ |word| encode_word(word) }.join(' ')
49
+ end
50
+
51
+ # Low-level method for encoding a single word using Cologne phonetics rules (spaces will be
52
+ # ignored). You most probably want to use {.encode} instead.
53
+ #
54
+ # @return [String] Encoded word (consists of digits only)
55
+ def self.encode_word(word)
56
+ Rules.apply_to(word).squeeze.gsub(/(.)0/, '\1')
57
+ end
58
+ end
metadata ADDED
@@ -0,0 +1,121 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: cologne_phonetics
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
5
+ platform: ruby
6
+ authors:
7
+ - Stefan Daschek
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2018-03-06 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bundler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '1.15'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.15'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rake
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '10.0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '10.0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rspec
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: '3.6'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '3.6'
55
+ - !ruby/object:Gem::Dependency
56
+ name: rubocop
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: 0.52.1
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: 0.52.1
69
+ - !ruby/object:Gem::Dependency
70
+ name: yard
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: 0.9.12
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: 0.9.12
83
+ description: Cologne phonetics (also Kölner Phonetik, Cologne process) is a phonetic
84
+ algorithm which assigns to words a sequence of digits, the phonetic code.
85
+ email:
86
+ - stefan@die-antwort.eu
87
+ executables: []
88
+ extensions: []
89
+ extra_rdoc_files: []
90
+ files:
91
+ - CHANGELOG.md
92
+ - LICENSE.txt
93
+ - README.md
94
+ - lib/cologne_phonetics.rb
95
+ - lib/cologne_phonetics/rules.rb
96
+ - lib/cologne_phonetics/version.rb
97
+ homepage: https://github.com/noniq/cologne_phonetics
98
+ licenses:
99
+ - MIT
100
+ metadata: {}
101
+ post_install_message:
102
+ rdoc_options: []
103
+ require_paths:
104
+ - lib
105
+ required_ruby_version: !ruby/object:Gem::Requirement
106
+ requirements:
107
+ - - ">="
108
+ - !ruby/object:Gem::Version
109
+ version: '0'
110
+ required_rubygems_version: !ruby/object:Gem::Requirement
111
+ requirements:
112
+ - - ">="
113
+ - !ruby/object:Gem::Version
114
+ version: '0'
115
+ requirements: []
116
+ rubyforge_project:
117
+ rubygems_version: 2.6.14
118
+ signing_key:
119
+ specification_version: 4
120
+ summary: Cologne phonetics (Kölner Phonetik) text encoding algorithm
121
+ test_files: []