cologne_phonetics 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/CHANGELOG.md +9 -0
- data/LICENSE.txt +21 -0
- data/README.md +67 -0
- data/lib/cologne_phonetics/rules.rb +53 -0
- data/lib/cologne_phonetics/version.rb +5 -0
- data/lib/cologne_phonetics.rb +58 -0
- metadata +121 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 34b708dd44c7f1aea872b47dfa5db5f9d3ad1b91
|
4
|
+
data.tar.gz: 939eaec4a50c1c3e9fff629ee1b52b2f0ca1470d
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: '0948c7a91ec2a0764ea8c06d7a8c9a2c9fab2acebdc5cae1cafdf07283dbd8813114513d94b1743e2c7f0c5a85ca416e7e73234d953707ba30bbada050ed5c52'
|
7
|
+
data.tar.gz: 741a913e92789e62e6d1335f941954f915e2a5fab9155801585d661b68426274492de36af677ca2c3fe7b4a25e64a4174e220f8388b873c48099486c186cdea3
|
data/CHANGELOG.md
ADDED
@@ -0,0 +1,9 @@
|
|
1
|
+
# Changelog
|
2
|
+
|
3
|
+
All notable changes to this project will be documented in this file.
|
4
|
+
|
5
|
+
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
|
6
|
+
|
7
|
+
## [1.0.0] – 2018-03-06
|
8
|
+
- Initial release.
|
9
|
+
|
data/LICENSE.txt
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
The MIT License (MIT)
|
2
|
+
|
3
|
+
Copyright (c) 2017 Stefan Daschek
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in
|
13
|
+
all copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
21
|
+
THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,67 @@
|
|
1
|
+
# ColognePhonetics
|
2
|
+
|
3
|
+
The [“Cologne phonetics (Kölner Phonetik)”](https://en.wikipedia.org/wiki/Cologne_phonetics) algorithm encodes words in a way that enables to search for similarly sounding words. It’s related to the [“Soundex”](https://en.wikipedia.org/wiki/Soundex) algorithm, but better suited for the German language.
|
4
|
+
|
5
|
+
This implementations closely follows the algorithm as described on its Wikipedia page. Support for umlauts (Ä, Ö, Ü) and ß has been added as suggested there.
|
6
|
+
|
7
|
+
Note that *other accented characters are not handled*. If your data may contain such characters you need to preprocess it (for example by using [`I18n.transliterate`](http://www.rubydoc.info/gems/i18n/I18n/Base#transliterate-instance_method)).
|
8
|
+
|
9
|
+
## Status
|
10
|
+
|
11
|
+
I consider this gem to be stable and (more or less) finished.
|
12
|
+
|
13
|
+
## Usage
|
14
|
+
|
15
|
+
Example usage:
|
16
|
+
|
17
|
+
```ruby
|
18
|
+
ColognePhonetics.encode('Wikipedia') # => "3412"
|
19
|
+
|
20
|
+
# Only basic characters and äöüß are handled, everything else gets ignored:
|
21
|
+
ColognePhonetics.encode('Åè1%-') # => ""
|
22
|
+
|
23
|
+
# If a string contains words separated by spaces, each word is encoded separately:
|
24
|
+
ColognePhonetics.encode('Heinz Classen') # => "068 4586"
|
25
|
+
|
26
|
+
# Use `encode_word` if you want to ignore spaces (note that this usually gives
|
27
|
+
# different results that using `encode` and removing spaces afterwards; see
|
28
|
+
# Wikipedia article for details):
|
29
|
+
ColognePhonetics.encode_word('Heinz Classen') # => "068586"
|
30
|
+
```
|
31
|
+
|
32
|
+
You can set `ColognePhonetics.debug = true` to get warnings printed to `$stderr` about characters that can not be encoded:
|
33
|
+
|
34
|
+
```ruby
|
35
|
+
ColognePhonetics.debug = true
|
36
|
+
ColognePhonetics.encode('Olé')
|
37
|
+
# Cologne Phonetics: No rule for 'é' (prev: 'l', next: '')
|
38
|
+
# => "05"
|
39
|
+
```
|
40
|
+
|
41
|
+
## Installation
|
42
|
+
|
43
|
+
Add this line to your application's Gemfile:
|
44
|
+
|
45
|
+
```ruby
|
46
|
+
gem 'cologne_phonetics'
|
47
|
+
```
|
48
|
+
|
49
|
+
And then execute:
|
50
|
+
|
51
|
+
$ bundle
|
52
|
+
|
53
|
+
Or install it yourself as:
|
54
|
+
|
55
|
+
$ gem install cologne_phonetics
|
56
|
+
|
57
|
+
## Development
|
58
|
+
|
59
|
+
After checking out the repo, run `bin/setup` to install dependencies. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
60
|
+
|
61
|
+
## Contributing
|
62
|
+
|
63
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/noniq/cologne_phonetics. Please make sure to include tests, and check that running `bin/rubocop` does not show any warnings.
|
64
|
+
|
65
|
+
## License
|
66
|
+
|
67
|
+
The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
|
@@ -0,0 +1,53 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module ColognePhonetics
|
4
|
+
# @api private
|
5
|
+
module Rules
|
6
|
+
def self.define(&block)
|
7
|
+
@rules = DSL.new(&block).rules
|
8
|
+
end
|
9
|
+
|
10
|
+
def self.apply_to(string)
|
11
|
+
string = string.downcase.tr('ÄÖÜ', 'äöü') # Ruby < 2.3 downcases ASCII characters only
|
12
|
+
chars = [nil] + string.chars + [nil]
|
13
|
+
chars.each_cons(3).map{ |prev_char, char, next_char|
|
14
|
+
code_for(prev_char, char, next_char)
|
15
|
+
}.join
|
16
|
+
end
|
17
|
+
|
18
|
+
def self.code_for(prev_char, char, next_char)
|
19
|
+
@rules.each do |matcher, code|
|
20
|
+
return code if matcher.call(prev_char, char, next_char)
|
21
|
+
end
|
22
|
+
debug_info "Cologne Phonetics: No rule for '#{char}' (prev: '#{prev_char}', next: '#{next_char}')"
|
23
|
+
nil
|
24
|
+
end
|
25
|
+
|
26
|
+
def self.debug_info(message)
|
27
|
+
return unless ColognePhonetics.debug
|
28
|
+
$stderr.puts message # rubocop:disable StderrPuts
|
29
|
+
end
|
30
|
+
|
31
|
+
class DSL
|
32
|
+
attr_reader :rules
|
33
|
+
|
34
|
+
def initialize(&block)
|
35
|
+
@rules = []
|
36
|
+
instance_exec(&block)
|
37
|
+
end
|
38
|
+
|
39
|
+
def change(chars, to:, before: nil, not_before: nil, after: nil, not_after: nil, initial: nil)
|
40
|
+
matcher = ->(prev_char, char, next_char){
|
41
|
+
return unless chars.include?(char)
|
42
|
+
return if initial && prev_char
|
43
|
+
return if before && (!next_char || !before.include?(next_char))
|
44
|
+
return if not_before && next_char && not_before.include?(next_char)
|
45
|
+
return if after && (!prev_char || !after.include?(prev_char))
|
46
|
+
return if not_after && prev_char && not_after.include?(prev_char)
|
47
|
+
true
|
48
|
+
}
|
49
|
+
@rules << [matcher, to]
|
50
|
+
end
|
51
|
+
end
|
52
|
+
end
|
53
|
+
end
|
@@ -0,0 +1,58 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'cologne_phonetics/rules'
|
4
|
+
require 'cologne_phonetics/version'
|
5
|
+
|
6
|
+
module ColognePhonetics
|
7
|
+
class << self
|
8
|
+
# Enable / disable debug mode. If set to true, using {.encode} or {.encode_word} will output
|
9
|
+
# warnings to `$stderr` if they encounter characters that cannot be encoded.
|
10
|
+
attr_accessor :debug
|
11
|
+
end
|
12
|
+
|
13
|
+
# rubocop:disable SpaceBeforeComma
|
14
|
+
Rules.define do
|
15
|
+
change 'aeijouy', to: '0'
|
16
|
+
change 'äöü' , to: '0' # additional rule: treat umlauts like vowels
|
17
|
+
change 'h' , to: ''
|
18
|
+
change 'b' , to: '1'
|
19
|
+
change 'p' , to: '1', not_before: 'h'
|
20
|
+
change 'dt' , to: '2', not_before: 'csz'
|
21
|
+
change 'fvw' , to: '3'
|
22
|
+
change 'p' , to: '3', before: 'h'
|
23
|
+
change 'gkq' , to: '4'
|
24
|
+
change 'c' , to: '4', initial: true, before: 'ahkloqrux'
|
25
|
+
change 'c' , to: '4', before: 'ahkoqux', not_after: 'sz'
|
26
|
+
change 'x' , to: '48', not_after: 'ckq'
|
27
|
+
change 'l' , to: '5'
|
28
|
+
change 'mn' , to: '6'
|
29
|
+
change 'r' , to: '7'
|
30
|
+
change 'sz' , to: '8'
|
31
|
+
change 'ß' , to: '8' # additional rule: treat 'ß' like 's'
|
32
|
+
change 'c' , to: '8', after: 'sz'
|
33
|
+
change 'c' , to: '8', initial: true, not_before: 'ahkloqrux'
|
34
|
+
change 'c' , to: '8', not_before: 'ahkoqux'
|
35
|
+
change 'dt' , to: '8', before: 'csz'
|
36
|
+
change 'x' , to: '8', after: 'ckq'
|
37
|
+
end
|
38
|
+
# rubocop:enable SpaceBeforeComma
|
39
|
+
|
40
|
+
# Encode string using Cologne phonetics rules. The encoding process can handle upper and lower case
|
41
|
+
# characters in the range of `a–z`, as well as `äöüß`. Everything else is ignored.
|
42
|
+
#
|
43
|
+
# If the string consists of several words separated by spaces, each word is encoded seperately,
|
44
|
+
# and the resulting codes are then again joined together with spaces.
|
45
|
+
#
|
46
|
+
# @return [String] Encoded string (consist of digits only, and maybe spaces)
|
47
|
+
def self.encode(string)
|
48
|
+
string.split(' ').map{ |word| encode_word(word) }.join(' ')
|
49
|
+
end
|
50
|
+
|
51
|
+
# Low-level method for encoding a single word using Cologne phonetics rules (spaces will be
|
52
|
+
# ignored). You most probably want to use {.encode} instead.
|
53
|
+
#
|
54
|
+
# @return [String] Encoded word (consists of digits only)
|
55
|
+
def self.encode_word(word)
|
56
|
+
Rules.apply_to(word).squeeze.gsub(/(.)0/, '\1')
|
57
|
+
end
|
58
|
+
end
|
metadata
ADDED
@@ -0,0 +1,121 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: cologne_phonetics
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 1.0.0
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Stefan Daschek
|
8
|
+
autorequire:
|
9
|
+
bindir: exe
|
10
|
+
cert_chain: []
|
11
|
+
date: 2018-03-06 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: bundler
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - "~>"
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '1.15'
|
20
|
+
type: :development
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - "~>"
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '1.15'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: rake
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - "~>"
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '10.0'
|
34
|
+
type: :development
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - "~>"
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '10.0'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: rspec
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - "~>"
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '3.6'
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - "~>"
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '3.6'
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
name: rubocop
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
58
|
+
requirements:
|
59
|
+
- - "~>"
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: 0.52.1
|
62
|
+
type: :development
|
63
|
+
prerelease: false
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
65
|
+
requirements:
|
66
|
+
- - "~>"
|
67
|
+
- !ruby/object:Gem::Version
|
68
|
+
version: 0.52.1
|
69
|
+
- !ruby/object:Gem::Dependency
|
70
|
+
name: yard
|
71
|
+
requirement: !ruby/object:Gem::Requirement
|
72
|
+
requirements:
|
73
|
+
- - "~>"
|
74
|
+
- !ruby/object:Gem::Version
|
75
|
+
version: 0.9.12
|
76
|
+
type: :development
|
77
|
+
prerelease: false
|
78
|
+
version_requirements: !ruby/object:Gem::Requirement
|
79
|
+
requirements:
|
80
|
+
- - "~>"
|
81
|
+
- !ruby/object:Gem::Version
|
82
|
+
version: 0.9.12
|
83
|
+
description: Cologne phonetics (also Kölner Phonetik, Cologne process) is a phonetic
|
84
|
+
algorithm which assigns to words a sequence of digits, the phonetic code.
|
85
|
+
email:
|
86
|
+
- stefan@die-antwort.eu
|
87
|
+
executables: []
|
88
|
+
extensions: []
|
89
|
+
extra_rdoc_files: []
|
90
|
+
files:
|
91
|
+
- CHANGELOG.md
|
92
|
+
- LICENSE.txt
|
93
|
+
- README.md
|
94
|
+
- lib/cologne_phonetics.rb
|
95
|
+
- lib/cologne_phonetics/rules.rb
|
96
|
+
- lib/cologne_phonetics/version.rb
|
97
|
+
homepage: https://github.com/noniq/cologne_phonetics
|
98
|
+
licenses:
|
99
|
+
- MIT
|
100
|
+
metadata: {}
|
101
|
+
post_install_message:
|
102
|
+
rdoc_options: []
|
103
|
+
require_paths:
|
104
|
+
- lib
|
105
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
106
|
+
requirements:
|
107
|
+
- - ">="
|
108
|
+
- !ruby/object:Gem::Version
|
109
|
+
version: '0'
|
110
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
111
|
+
requirements:
|
112
|
+
- - ">="
|
113
|
+
- !ruby/object:Gem::Version
|
114
|
+
version: '0'
|
115
|
+
requirements: []
|
116
|
+
rubyforge_project:
|
117
|
+
rubygems_version: 2.6.14
|
118
|
+
signing_key:
|
119
|
+
specification_version: 4
|
120
|
+
summary: Cologne phonetics (Kölner Phonetik) text encoding algorithm
|
121
|
+
test_files: []
|