connectors_utility 8.4.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 824ceaf9dec38db287b8ec5739847b5dc08b3c1ea140bce575076bceb8b6caf6
4
+ data.tar.gz: b287e744ebe57f49162437744649e0e2d304c3e0a2ecfbb1bdeb082e0463f9f8
5
+ SHA512:
6
+ metadata.gz: 856965cb7aca1080b86e241e688495c18228dd75d9ee1dc895aed16b3ab86f5a8694c6715bd0e88699803665590c3d4c18d302e3b593ee4fba4a85ccc921c778
7
+ data.tar.gz: 32398b3b4c4771665dcddd9f493f7ca1155757828cfcb9ddb52ca9f2f2e8eca09d4f8f1a84d6aee108d9af18da2dc5dd2eb1829009bc1bce1896a87b28fb7474
data/LICENSE ADDED
@@ -0,0 +1,93 @@
1
+ Elastic License 2.0
2
+
3
+ URL: https://www.elastic.co/licensing/elastic-license
4
+
5
+ ## Acceptance
6
+
7
+ By using the software, you agree to all of the terms and conditions below.
8
+
9
+ ## Copyright License
10
+
11
+ The licensor grants you a non-exclusive, royalty-free, worldwide,
12
+ non-sublicensable, non-transferable license to use, copy, distribute, make
13
+ available, and prepare derivative works of the software, in each case subject to
14
+ the limitations and conditions below.
15
+
16
+ ## Limitations
17
+
18
+ You may not provide the software to third parties as a hosted or managed
19
+ service, where the service provides users with access to any substantial set of
20
+ the features or functionality of the software.
21
+
22
+ You may not move, change, disable, or circumvent the license key functionality
23
+ in the software, and you may not remove or obscure any functionality in the
24
+ software that is protected by the license key.
25
+
26
+ You may not alter, remove, or obscure any licensing, copyright, or other notices
27
+ of the licensor in the software. Any use of the licensor’s trademarks is subject
28
+ to applicable law.
29
+
30
+ ## Patents
31
+
32
+ The licensor grants you a license, under any patent claims the licensor can
33
+ license, or becomes able to license, to make, have made, use, sell, offer for
34
+ sale, import and have imported the software, in each case subject to the
35
+ limitations and conditions in this license. This license does not cover any
36
+ patent claims that you cause to be infringed by modifications or additions to
37
+ the software. If you or your company make any written claim that the software
38
+ infringes or contributes to infringement of any patent, your patent license for
39
+ the software granted under these terms ends immediately. If your company makes
40
+ such a claim, your patent license ends immediately for work on behalf of your
41
+ company.
42
+
43
+ ## Notices
44
+
45
+ You must ensure that anyone who gets a copy of any part of the software from you
46
+ also gets a copy of these terms.
47
+
48
+ If you modify the software, you must include in any modified copies of the
49
+ software prominent notices stating that you have modified the software.
50
+
51
+ ## No Other Rights
52
+
53
+ These terms do not imply any licenses other than those expressly granted in
54
+ these terms.
55
+
56
+ ## Termination
57
+
58
+ If you use the software in violation of these terms, such use is not licensed,
59
+ and your licenses will automatically terminate. If the licensor provides you
60
+ with a notice of your violation, and you cease all violation of this license no
61
+ later than 30 days after you receive that notice, your licenses will be
62
+ reinstated retroactively. However, if you violate these terms after such
63
+ reinstatement, any additional violation of these terms will cause your licenses
64
+ to terminate automatically and permanently.
65
+
66
+ ## No Liability
67
+
68
+ *As far as the law allows, the software comes as is, without any warranty or
69
+ condition, and the licensor will not be liable to you for any damages arising
70
+ out of these terms or the use or nature of the software, under any kind of
71
+ legal claim.*
72
+
73
+ ## Definitions
74
+
75
+ The **licensor** is the entity offering these terms, and the **software** is the
76
+ software the licensor makes available under these terms, including any portion
77
+ of it.
78
+
79
+ **you** refers to the individual or entity agreeing to these terms.
80
+
81
+ **your company** is any legal entity, sole proprietorship, or other kind of
82
+ organization that you work for, plus all organizations that have control over,
83
+ are under the control of, or are under common control with that
84
+ organization. **control** means ownership of substantially all the assets of an
85
+ entity, or the power to direct its management and policies by vote, contract, or
86
+ otherwise. Control can be direct or indirect.
87
+
88
+ **your licenses** are all the licenses granted to you for the software under
89
+ these terms.
90
+
91
+ **use** means anything you do with the software requiring one of your licenses.
92
+
93
+ **trademark** means trademarks, service marks, and similar rights.
data/NOTICE.txt ADDED
@@ -0,0 +1,2 @@
1
+ connectors
2
+ Copyright 2022 Elasticsearch B.V.
@@ -0,0 +1,10 @@
1
+ #
2
+ # Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
3
+ # or more contributor license agreements. Licensed under the Elastic License;
4
+ # you may not use this file except in compliance with the Elastic License.
5
+ #
6
+
7
+ # frozen_string_literal: true
8
+
9
+ require_relative 'utility/elasticsearch/index/text_analysis_settings'
10
+ require_relative 'utility/elasticsearch/index/mappings'
@@ -0,0 +1,111 @@
1
+ ---
2
+ da:
3
+ name: Danish
4
+ stemmer: danish
5
+ stop_words: _danish_
6
+ de:
7
+ name: German
8
+ stemmer: light_german
9
+ stop_words: _german_
10
+ en:
11
+ name: English
12
+ stemmer: light_english
13
+ stop_words: _english_
14
+ es:
15
+ name: Spanish
16
+ stemmer: light_spanish
17
+ stop_words: _spanish_
18
+ fr:
19
+ name: French
20
+ stemmer: light_french
21
+ stop_words: _french_
22
+ custom_filter_definitions:
23
+ fr-elision:
24
+ type: elision
25
+ articles:
26
+ - l
27
+ - m
28
+ - t
29
+ - qu
30
+ - n
31
+ - s
32
+ - j
33
+ - d
34
+ - c
35
+ - jusqu
36
+ - quoiqu
37
+ - lorsqu
38
+ - puisqu
39
+ articles_case: true
40
+ prepended_filters:
41
+ - fr-elision
42
+ it:
43
+ name: Italian
44
+ stemmer: light_italian
45
+ stop_words: _italian_
46
+ custom_filter_definitions:
47
+ it-elision:
48
+ type: elision
49
+ articles:
50
+ - c
51
+ - l
52
+ - all
53
+ - dall
54
+ - dell
55
+ - nell
56
+ - sull
57
+ - coll
58
+ - pell
59
+ - gl
60
+ - agl
61
+ - dagl
62
+ - degl
63
+ - negl
64
+ - sugl
65
+ - un
66
+ - m
67
+ - t
68
+ - s
69
+ - v
70
+ - d
71
+ articles_case: true
72
+ prepended_filters:
73
+ - it-elision
74
+ ja:
75
+ name: Japanese
76
+ stemmer: light_english
77
+ stop_words: _english_
78
+ postpended_filters:
79
+ - cjk_bigram
80
+ ko:
81
+ name: Korean
82
+ stemmer: light_english
83
+ stop_words: _english_
84
+ postpended_filters:
85
+ - cjk_bigram
86
+ nl:
87
+ name: Dutch
88
+ stemmer: dutch
89
+ stop_words: _dutch_
90
+ pt:
91
+ name: Portuguese
92
+ stemmer: light_portuguese
93
+ stop_words: _portuguese_
94
+ pt-br:
95
+ name: Portuguese (Brazil)
96
+ stemmer: brazilian
97
+ stop_words: _brazilian_
98
+ ru:
99
+ name: Russian
100
+ stemmer: russian
101
+ stop_words: _russian_
102
+ th:
103
+ name: Thai
104
+ stemmer: light_english
105
+ stop_words: _thai_
106
+ zh:
107
+ name: Chinese
108
+ stemmer: light_english
109
+ stop_words: _english_
110
+ postpended_filters:
111
+ - cjk_bigram
@@ -0,0 +1,78 @@
1
+ #
2
+ # Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
3
+ # or more contributor license agreements. Licensed under the Elastic License;
4
+ # you may not use this file except in compliance with the Elastic License.
5
+ #
6
+
7
+ # frozen_string_literal: true
8
+
9
+ module Utility
10
+ module Elasticsearch
11
+ module Index
12
+ module Mappings
13
+ ENUM_IGNORE_ABOVE = 2048
14
+
15
+ WORKPLACE_SEARCH_SUBEXTRACTION_STAMP_FIELD_MAPPINGS = {
16
+ _subextracted_as_of: {
17
+ type: 'date'
18
+ },
19
+ _subextracted_version: {
20
+ type: 'keyword'
21
+ }
22
+ }.freeze
23
+
24
+ def self.default_text_fields_mappings(connectors_index:)
25
+ {
26
+ dynamic: true,
27
+ dynamic_templates: [
28
+ {
29
+ data: {
30
+ match_mapping_type: 'string',
31
+ mapping: {
32
+ type: 'text',
33
+ analyzer: 'iq_text_base',
34
+ index_options: 'freqs',
35
+ fields: {
36
+ 'stem': {
37
+ type: 'text',
38
+ analyzer: 'iq_text_stem'
39
+ },
40
+ 'prefix' => {
41
+ type: 'text',
42
+ analyzer: 'i_prefix',
43
+ search_analyzer: 'q_prefix',
44
+ index_options: 'docs'
45
+ },
46
+ 'delimiter' => {
47
+ type: 'text',
48
+ analyzer: 'iq_text_delimiter',
49
+ index_options: 'freqs'
50
+ },
51
+ 'joined': {
52
+ type: 'text',
53
+ analyzer: 'i_text_bigram',
54
+ search_analyzer: 'q_text_bigram',
55
+ index_options: 'freqs'
56
+ },
57
+ 'enum': {
58
+ type: 'keyword',
59
+ ignore_above: ENUM_IGNORE_ABOVE
60
+ }
61
+ }
62
+ }
63
+ }
64
+ }
65
+ ],
66
+ properties: {
67
+ id: {
68
+ type: 'keyword'
69
+ }
70
+ }.tap do |properties|
71
+ properties.merge!(WORKPLACE_SEARCH_SUBEXTRACTION_STAMP_FIELD_MAPPINGS) if connectors_index
72
+ end
73
+ }
74
+ end
75
+ end
76
+ end
77
+ end
78
+ end
@@ -0,0 +1,226 @@
1
+ #
2
+ # Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
3
+ # or more contributor license agreements. Licensed under the Elastic License;
4
+ # you may not use this file except in compliance with the Elastic License.
5
+ #
6
+
7
+ # frozen_string_literal: true
8
+
9
+ require 'yaml'
10
+
11
+ module Utility
12
+ module Elasticsearch
13
+ module Index
14
+ class TextAnalysisSettings
15
+ class UnsupportedLanguageCode < StandardError; end
16
+
17
+ DEFAULT_LANGUAGE = :en
18
+ FRONT_NGRAM_MAX_GRAM = 12
19
+ LANGUAGE_DATA_FILE_PATH = File.join(File.dirname(__FILE__), 'language_data.yml')
20
+
21
+ GENERIC_FILTERS = {
22
+ front_ngram: {
23
+ type: 'edge_ngram',
24
+ min_gram: 1,
25
+ max_gram: FRONT_NGRAM_MAX_GRAM
26
+ },
27
+ delimiter: {
28
+ type: 'word_delimiter_graph',
29
+ generate_word_parts: true,
30
+ generate_number_parts: true,
31
+ catenate_words: true,
32
+ catenate_numbers: true,
33
+ catenate_all: true,
34
+ preserve_original: false,
35
+ split_on_case_change: true,
36
+ split_on_numerics: true,
37
+ stem_english_possessive: true
38
+ },
39
+ bigram_joiner: {
40
+ type: 'shingle',
41
+ token_separator: '',
42
+ max_shingle_size: 2,
43
+ output_unigrams: false
44
+ },
45
+ bigram_joiner_unigrams: {
46
+ type: 'shingle',
47
+ token_separator: '',
48
+ max_shingle_size: 2,
49
+ output_unigrams: true
50
+ },
51
+ bigram_max_size: {
52
+ type: 'length',
53
+ min: 0,
54
+ max: 16
55
+ }
56
+ }.freeze
57
+
58
+ NON_ICU_ANALYSIS_SETTINGS = {
59
+ tokenizer_name: 'standard', folding_filters: %w(cjk_width lowercase asciifolding)
60
+ }.freeze
61
+
62
+ ICU_ANALYSIS_SETTINGS = {
63
+ tokenizer_name: 'icu_tokenizer', folding_filters: %w(icu_folding)
64
+ }.freeze
65
+
66
+ def initialize(language_code: nil, analysis_icu: false)
67
+ @language_code = (language_code || DEFAULT_LANGUAGE).to_sym
68
+
69
+ raise UnsupportedLanguageCode, "Language '#{language_code}' is not supported" unless language_data[@language_code]
70
+
71
+ @analysis_icu = analysis_icu
72
+ @analysis_settings = icu_settings(analysis_icu)
73
+ end
74
+
75
+ def to_h
76
+ {
77
+ analysis: {
78
+ analyzer: analyzer_definitions,
79
+ filter: filter_definitions
80
+ },
81
+ index: {
82
+ similarity: {
83
+ default: {
84
+ type: 'BM25'
85
+ }
86
+ }
87
+ }
88
+ }
89
+ end
90
+
91
+ private
92
+
93
+ attr_reader :language_code, :analysis_settings
94
+
95
+ def icu_settings(analysis_settings)
96
+ return ICU_ANALYSIS_SETTINGS if analysis_settings
97
+
98
+ NON_ICU_ANALYSIS_SETTINGS
99
+ end
100
+
101
+ def stemmer_name
102
+ language_data[language_code][:stemmer]
103
+ end
104
+
105
+ def stop_words_name_or_list
106
+ language_data[language_code][:stop_words]
107
+ end
108
+
109
+ def custom_filter_definitions
110
+ language_data[language_code][:custom_filter_definitions] || {}
111
+ end
112
+
113
+ def prepended_filters
114
+ language_data[language_code][:prepended_filters] || []
115
+ end
116
+
117
+ def postpended_filters
118
+ language_data[language_code][:postpended_filters] || []
119
+ end
120
+
121
+ def stem_filter_name
122
+ "#{language_code}-stem-filter".to_sym
123
+ end
124
+
125
+ def stop_words_filter_name
126
+ "#{language_code}-stop-words-filter".to_sym
127
+ end
128
+
129
+ def filter_definitions
130
+ definitions = GENERIC_FILTERS.dup
131
+
132
+ definitions[stem_filter_name] = {
133
+ type: 'stemmer',
134
+ name: stemmer_name
135
+ }
136
+
137
+ definitions[stop_words_filter_name] = {
138
+ type: 'stop',
139
+ stopwords: stop_words_name_or_list
140
+ }
141
+
142
+ definitions.merge(custom_filter_definitions)
143
+ end
144
+
145
+ def analyzer_definitions
146
+ definitions = {}
147
+
148
+ definitions[:i_prefix] = {
149
+ tokenizer: analysis_settings[:tokenizer_name],
150
+ filter: [
151
+ *analysis_settings[:folding_filters],
152
+ 'front_ngram'
153
+ ]
154
+ }
155
+
156
+ definitions[:q_prefix] = {
157
+ tokenizer: analysis_settings[:tokenizer_name],
158
+ filter: [
159
+ *analysis_settings[:folding_filters]
160
+ ]
161
+ }
162
+
163
+ definitions[:iq_text_base] = {
164
+ tokenizer: analysis_settings[:tokenizer_name],
165
+ filter: [
166
+ *analysis_settings[:folding_filters],
167
+ stop_words_filter_name
168
+ ]
169
+ }
170
+
171
+ definitions[:iq_text_stem] = {
172
+ tokenizer: analysis_settings[:tokenizer_name],
173
+ filter: [
174
+ *prepended_filters,
175
+ *analysis_settings[:folding_filters],
176
+ stop_words_filter_name,
177
+ stem_filter_name,
178
+ *postpended_filters
179
+ ]
180
+ }
181
+
182
+ definitions[:iq_text_delimiter] = {
183
+ tokenizer: 'whitespace',
184
+ filter: [
185
+ *prepended_filters,
186
+ 'delimiter',
187
+ *analysis_settings[:folding_filters],
188
+ stop_words_filter_name,
189
+ stem_filter_name,
190
+ *postpended_filters
191
+ ]
192
+ }
193
+
194
+ definitions[:i_text_bigram] = {
195
+ tokenizer: analysis_settings[:tokenizer_name],
196
+ filter: [
197
+ *analysis_settings[:folding_filters],
198
+ stem_filter_name,
199
+ 'bigram_joiner',
200
+ 'bigram_max_size'
201
+ ]
202
+ }
203
+
204
+ definitions[:q_text_bigram] = {
205
+ tokenizer: analysis_settings[:tokenizer_name],
206
+ filter: [
207
+ *analysis_settings[:folding_filters],
208
+ stem_filter_name,
209
+ 'bigram_joiner_unigrams',
210
+ 'bigram_max_size'
211
+ ]
212
+ }
213
+
214
+ definitions
215
+ end
216
+
217
+ def language_data
218
+ @language_data ||= YAML.safe_load(
219
+ File.read(LANGUAGE_DATA_FILE_PATH),
220
+ symbolize_names: true
221
+ )
222
+ end
223
+ end
224
+ end
225
+ end
226
+ end
metadata ADDED
@@ -0,0 +1,50 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: connectors_utility
3
+ version: !ruby/object:Gem::Version
4
+ version: 8.4.0.0
5
+ platform: ruby
6
+ authors:
7
+ - Elastic
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2022-07-14 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description: ''
14
+ email: ent-search-dev@elastic.co
15
+ executables: []
16
+ extensions: []
17
+ extra_rdoc_files: []
18
+ files:
19
+ - LICENSE
20
+ - NOTICE.txt
21
+ - lib/connectors_utility.rb
22
+ - lib/utility/elasticsearch/index/language_data.yml
23
+ - lib/utility/elasticsearch/index/mappings.rb
24
+ - lib/utility/elasticsearch/index/text_analysis_settings.rb
25
+ homepage: https://github.com/elastic/connectors-ruby
26
+ licenses:
27
+ - Elastic-2.0
28
+ metadata:
29
+ revision: c9283d0e12a3ae8253225becbefef02d0c6153c8
30
+ repository: git@github.com:elastic/connectors.git
31
+ post_install_message:
32
+ rdoc_options: []
33
+ require_paths:
34
+ - lib
35
+ required_ruby_version: !ruby/object:Gem::Requirement
36
+ requirements:
37
+ - - ">="
38
+ - !ruby/object:Gem::Version
39
+ version: '0'
40
+ required_rubygems_version: !ruby/object:Gem::Requirement
41
+ requirements:
42
+ - - ">="
43
+ - !ruby/object:Gem::Version
44
+ version: '0'
45
+ requirements: []
46
+ rubygems_version: 3.0.3.1
47
+ signing_key:
48
+ specification_version: 4
49
+ summary: Gem containing shared Connector Services libraries
50
+ test_files: []