cologne_phonetics 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/CHANGELOG.md +9 -0
- data/LICENSE.txt +21 -0
- data/README.md +67 -0
- data/lib/cologne_phonetics/rules.rb +53 -0
- data/lib/cologne_phonetics/version.rb +5 -0
- data/lib/cologne_phonetics.rb +58 -0
- metadata +121 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 34b708dd44c7f1aea872b47dfa5db5f9d3ad1b91
|
4
|
+
data.tar.gz: 939eaec4a50c1c3e9fff629ee1b52b2f0ca1470d
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: '0948c7a91ec2a0764ea8c06d7a8c9a2c9fab2acebdc5cae1cafdf07283dbd8813114513d94b1743e2c7f0c5a85ca416e7e73234d953707ba30bbada050ed5c52'
|
7
|
+
data.tar.gz: 741a913e92789e62e6d1335f941954f915e2a5fab9155801585d661b68426274492de36af677ca2c3fe7b4a25e64a4174e220f8388b873c48099486c186cdea3
|
data/CHANGELOG.md
ADDED
@@ -0,0 +1,9 @@
|
|
1
|
+
# Changelog
|
2
|
+
|
3
|
+
All notable changes to this project will be documented in this file.
|
4
|
+
|
5
|
+
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
|
6
|
+
|
7
|
+
## [1.0.0] – 2018-03-06
|
8
|
+
- Initial release.
|
9
|
+
|
data/LICENSE.txt
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
The MIT License (MIT)
|
2
|
+
|
3
|
+
Copyright (c) 2017 Stefan Daschek
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in
|
13
|
+
all copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
21
|
+
THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,67 @@
|
|
1
|
+
# ColognePhonetics
|
2
|
+
|
3
|
+
The [“Cologne phonetics (Kölner Phonetik)”](https://en.wikipedia.org/wiki/Cologne_phonetics) algorithm encodes words in a way that enables to search for similarly sounding words. It’s related to the [“Soundex”](https://en.wikipedia.org/wiki/Soundex) algorithm, but better suited for the German language.
|
4
|
+
|
5
|
+
This implementations closely follows the algorithm as described on its Wikipedia page. Support for umlauts (Ä, Ö, Ü) and ß has been added as suggested there.
|
6
|
+
|
7
|
+
Note that *other accented characters are not handled*. If your data may contain such characters you need to preprocess it (for example by using [`I18n.transliterate`](http://www.rubydoc.info/gems/i18n/I18n/Base#transliterate-instance_method)).
|
8
|
+
|
9
|
+
## Status
|
10
|
+
|
11
|
+
I consider this gem to be stable and (more or less) finished.
|
12
|
+
|
13
|
+
## Usage
|
14
|
+
|
15
|
+
Example usage:
|
16
|
+
|
17
|
+
```ruby
|
18
|
+
ColognePhonetics.encode('Wikipedia') # => "3412"
|
19
|
+
|
20
|
+
# Only basic characters and äöüß are handled, everything else gets ignored:
|
21
|
+
ColognePhonetics.encode('Åè1%-') # => ""
|
22
|
+
|
23
|
+
# If a string contains words separated by spaces, each word is encoded separately:
|
24
|
+
ColognePhonetics.encode('Heinz Classen') # => "068 4586"
|
25
|
+
|
26
|
+
# Use `encode_word` if you want to ignore spaces (note that this usually gives
|
27
|
+
# different results that using `encode` and removing spaces afterwards; see
|
28
|
+
# Wikipedia article for details):
|
29
|
+
ColognePhonetics.encode_word('Heinz Classen') # => "068586"
|
30
|
+
```
|
31
|
+
|
32
|
+
You can set `ColognePhonetics.debug = true` to get warnings printed to `$stderr` about characters that can not be encoded:
|
33
|
+
|
34
|
+
```ruby
|
35
|
+
ColognePhonetics.debug = true
|
36
|
+
ColognePhonetics.encode('Olé')
|
37
|
+
# Cologne Phonetics: No rule for 'é' (prev: 'l', next: '')
|
38
|
+
# => "05"
|
39
|
+
```
|
40
|
+
|
41
|
+
## Installation
|
42
|
+
|
43
|
+
Add this line to your application's Gemfile:
|
44
|
+
|
45
|
+
```ruby
|
46
|
+
gem 'cologne_phonetics'
|
47
|
+
```
|
48
|
+
|
49
|
+
And then execute:
|
50
|
+
|
51
|
+
$ bundle
|
52
|
+
|
53
|
+
Or install it yourself as:
|
54
|
+
|
55
|
+
$ gem install cologne_phonetics
|
56
|
+
|
57
|
+
## Development
|
58
|
+
|
59
|
+
After checking out the repo, run `bin/setup` to install dependencies. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
60
|
+
|
61
|
+
## Contributing
|
62
|
+
|
63
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/noniq/cologne_phonetics. Please make sure to include tests, and check that running `bin/rubocop` does not show any warnings.
|
64
|
+
|
65
|
+
## License
|
66
|
+
|
67
|
+
The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
|
@@ -0,0 +1,53 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module ColognePhonetics
|
4
|
+
# @api private
|
5
|
+
module Rules
|
6
|
+
def self.define(&block)
|
7
|
+
@rules = DSL.new(&block).rules
|
8
|
+
end
|
9
|
+
|
10
|
+
def self.apply_to(string)
|
11
|
+
string = string.downcase.tr('ÄÖÜ', 'äöü') # Ruby < 2.3 downcases ASCII characters only
|
12
|
+
chars = [nil] + string.chars + [nil]
|
13
|
+
chars.each_cons(3).map{ |prev_char, char, next_char|
|
14
|
+
code_for(prev_char, char, next_char)
|
15
|
+
}.join
|
16
|
+
end
|
17
|
+
|
18
|
+
def self.code_for(prev_char, char, next_char)
|
19
|
+
@rules.each do |matcher, code|
|
20
|
+
return code if matcher.call(prev_char, char, next_char)
|
21
|
+
end
|
22
|
+
debug_info "Cologne Phonetics: No rule for '#{char}' (prev: '#{prev_char}', next: '#{next_char}')"
|
23
|
+
nil
|
24
|
+
end
|
25
|
+
|
26
|
+
def self.debug_info(message)
|
27
|
+
return unless ColognePhonetics.debug
|
28
|
+
$stderr.puts message # rubocop:disable StderrPuts
|
29
|
+
end
|
30
|
+
|
31
|
+
class DSL
|
32
|
+
attr_reader :rules
|
33
|
+
|
34
|
+
def initialize(&block)
|
35
|
+
@rules = []
|
36
|
+
instance_exec(&block)
|
37
|
+
end
|
38
|
+
|
39
|
+
def change(chars, to:, before: nil, not_before: nil, after: nil, not_after: nil, initial: nil)
|
40
|
+
matcher = ->(prev_char, char, next_char){
|
41
|
+
return unless chars.include?(char)
|
42
|
+
return if initial && prev_char
|
43
|
+
return if before && (!next_char || !before.include?(next_char))
|
44
|
+
return if not_before && next_char && not_before.include?(next_char)
|
45
|
+
return if after && (!prev_char || !after.include?(prev_char))
|
46
|
+
return if not_after && prev_char && not_after.include?(prev_char)
|
47
|
+
true
|
48
|
+
}
|
49
|
+
@rules << [matcher, to]
|
50
|
+
end
|
51
|
+
end
|
52
|
+
end
|
53
|
+
end
|
@@ -0,0 +1,58 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'cologne_phonetics/rules'
|
4
|
+
require 'cologne_phonetics/version'
|
5
|
+
|
6
|
+
module ColognePhonetics
|
7
|
+
class << self
|
8
|
+
# Enable / disable debug mode. If set to true, using {.encode} or {.encode_word} will output
|
9
|
+
# warnings to `$stderr` if they encounter characters that cannot be encoded.
|
10
|
+
attr_accessor :debug
|
11
|
+
end
|
12
|
+
|
13
|
+
# rubocop:disable SpaceBeforeComma
|
14
|
+
Rules.define do
|
15
|
+
change 'aeijouy', to: '0'
|
16
|
+
change 'äöü' , to: '0' # additional rule: treat umlauts like vowels
|
17
|
+
change 'h' , to: ''
|
18
|
+
change 'b' , to: '1'
|
19
|
+
change 'p' , to: '1', not_before: 'h'
|
20
|
+
change 'dt' , to: '2', not_before: 'csz'
|
21
|
+
change 'fvw' , to: '3'
|
22
|
+
change 'p' , to: '3', before: 'h'
|
23
|
+
change 'gkq' , to: '4'
|
24
|
+
change 'c' , to: '4', initial: true, before: 'ahkloqrux'
|
25
|
+
change 'c' , to: '4', before: 'ahkoqux', not_after: 'sz'
|
26
|
+
change 'x' , to: '48', not_after: 'ckq'
|
27
|
+
change 'l' , to: '5'
|
28
|
+
change 'mn' , to: '6'
|
29
|
+
change 'r' , to: '7'
|
30
|
+
change 'sz' , to: '8'
|
31
|
+
change 'ß' , to: '8' # additional rule: treat 'ß' like 's'
|
32
|
+
change 'c' , to: '8', after: 'sz'
|
33
|
+
change 'c' , to: '8', initial: true, not_before: 'ahkloqrux'
|
34
|
+
change 'c' , to: '8', not_before: 'ahkoqux'
|
35
|
+
change 'dt' , to: '8', before: 'csz'
|
36
|
+
change 'x' , to: '8', after: 'ckq'
|
37
|
+
end
|
38
|
+
# rubocop:enable SpaceBeforeComma
|
39
|
+
|
40
|
+
# Encode string using Cologne phonetics rules. The encoding process can handle upper and lower case
|
41
|
+
# characters in the range of `a–z`, as well as `äöüß`. Everything else is ignored.
|
42
|
+
#
|
43
|
+
# If the string consists of several words separated by spaces, each word is encoded seperately,
|
44
|
+
# and the resulting codes are then again joined together with spaces.
|
45
|
+
#
|
46
|
+
# @return [String] Encoded string (consist of digits only, and maybe spaces)
|
47
|
+
def self.encode(string)
|
48
|
+
string.split(' ').map{ |word| encode_word(word) }.join(' ')
|
49
|
+
end
|
50
|
+
|
51
|
+
# Low-level method for encoding a single word using Cologne phonetics rules (spaces will be
|
52
|
+
# ignored). You most probably want to use {.encode} instead.
|
53
|
+
#
|
54
|
+
# @return [String] Encoded word (consists of digits only)
|
55
|
+
def self.encode_word(word)
|
56
|
+
Rules.apply_to(word).squeeze.gsub(/(.)0/, '\1')
|
57
|
+
end
|
58
|
+
end
|
metadata
ADDED
@@ -0,0 +1,121 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: cologne_phonetics
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 1.0.0
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Stefan Daschek
|
8
|
+
autorequire:
|
9
|
+
bindir: exe
|
10
|
+
cert_chain: []
|
11
|
+
date: 2018-03-06 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: bundler
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - "~>"
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '1.15'
|
20
|
+
type: :development
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - "~>"
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '1.15'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: rake
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - "~>"
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '10.0'
|
34
|
+
type: :development
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - "~>"
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '10.0'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: rspec
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - "~>"
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '3.6'
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - "~>"
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '3.6'
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
name: rubocop
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
58
|
+
requirements:
|
59
|
+
- - "~>"
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: 0.52.1
|
62
|
+
type: :development
|
63
|
+
prerelease: false
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
65
|
+
requirements:
|
66
|
+
- - "~>"
|
67
|
+
- !ruby/object:Gem::Version
|
68
|
+
version: 0.52.1
|
69
|
+
- !ruby/object:Gem::Dependency
|
70
|
+
name: yard
|
71
|
+
requirement: !ruby/object:Gem::Requirement
|
72
|
+
requirements:
|
73
|
+
- - "~>"
|
74
|
+
- !ruby/object:Gem::Version
|
75
|
+
version: 0.9.12
|
76
|
+
type: :development
|
77
|
+
prerelease: false
|
78
|
+
version_requirements: !ruby/object:Gem::Requirement
|
79
|
+
requirements:
|
80
|
+
- - "~>"
|
81
|
+
- !ruby/object:Gem::Version
|
82
|
+
version: 0.9.12
|
83
|
+
description: Cologne phonetics (also Kölner Phonetik, Cologne process) is a phonetic
|
84
|
+
algorithm which assigns to words a sequence of digits, the phonetic code.
|
85
|
+
email:
|
86
|
+
- stefan@die-antwort.eu
|
87
|
+
executables: []
|
88
|
+
extensions: []
|
89
|
+
extra_rdoc_files: []
|
90
|
+
files:
|
91
|
+
- CHANGELOG.md
|
92
|
+
- LICENSE.txt
|
93
|
+
- README.md
|
94
|
+
- lib/cologne_phonetics.rb
|
95
|
+
- lib/cologne_phonetics/rules.rb
|
96
|
+
- lib/cologne_phonetics/version.rb
|
97
|
+
homepage: https://github.com/noniq/cologne_phonetics
|
98
|
+
licenses:
|
99
|
+
- MIT
|
100
|
+
metadata: {}
|
101
|
+
post_install_message:
|
102
|
+
rdoc_options: []
|
103
|
+
require_paths:
|
104
|
+
- lib
|
105
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
106
|
+
requirements:
|
107
|
+
- - ">="
|
108
|
+
- !ruby/object:Gem::Version
|
109
|
+
version: '0'
|
110
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
111
|
+
requirements:
|
112
|
+
- - ">="
|
113
|
+
- !ruby/object:Gem::Version
|
114
|
+
version: '0'
|
115
|
+
requirements: []
|
116
|
+
rubyforge_project:
|
117
|
+
rubygems_version: 2.6.14
|
118
|
+
signing_key:
|
119
|
+
specification_version: 4
|
120
|
+
summary: Cologne phonetics (Kölner Phonetik) text encoding algorithm
|
121
|
+
test_files: []
|