ebm4subjects 0.4.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,20 @@
1
+ # Python-generated files
2
+ __pycache__/
3
+ *.py[oc]
4
+ build/
5
+ dist/
6
+ wheels/
7
+ *.egg-info
8
+
9
+ # Virtual environments
10
+ .venv
11
+
12
+ # uv lock file
13
+ uv.lock
14
+
15
+ # Continue helper files
16
+ .continue/**
17
+
18
+ # Test output and local test script
19
+ test_out/*
20
+ tests/run_test.py
@@ -0,0 +1 @@
1
+ 3.11
@@ -0,0 +1,287 @@
1
+ EUROPEAN UNION PUBLIC LICENCE v. 1.2
2
+ EUPL © the European Union 2007, 2016
3
+
4
+ This European Union Public Licence (the ‘EUPL’) applies to the Work (as defined
5
+ below) which is provided under the terms of this Licence. Any use of the Work,
6
+ other than as authorised under this Licence is prohibited (to the extent such
7
+ use is covered by a right of the copyright holder of the Work).
8
+
9
+ The Work is provided under the terms of this Licence when the Licensor (as
10
+ defined below) has placed the following notice immediately following the
11
+ copyright notice for the Work:
12
+
13
+ Licensed under the EUPL
14
+
15
+ or has expressed by any other means his willingness to license under the EUPL.
16
+
17
+ 1. Definitions
18
+
19
+ In this Licence, the following terms have the following meaning:
20
+
21
+ - ‘The Licence’: this Licence.
22
+
23
+ - ‘The Original Work’: the work or software distributed or communicated by the
24
+ Licensor under this Licence, available as Source Code and also as Executable
25
+ Code as the case may be.
26
+
27
+ - ‘Derivative Works’: the works or software that could be created by the
28
+ Licensee, based upon the Original Work or modifications thereof. This Licence
29
+ does not define the extent of modification or dependence on the Original Work
30
+ required in order to classify a work as a Derivative Work; this extent is
31
+ determined by copyright law applicable in the country mentioned in Article 15.
32
+
33
+ - ‘The Work’: the Original Work or its Derivative Works.
34
+
35
+ - ‘The Source Code’: the human-readable form of the Work which is the most
36
+ convenient for people to study and modify.
37
+
38
+ - ‘The Executable Code’: any code which has generally been compiled and which is
39
+ meant to be interpreted by a computer as a program.
40
+
41
+ - ‘The Licensor’: the natural or legal person that distributes or communicates
42
+ the Work under the Licence.
43
+
44
+ - ‘Contributor(s)’: any natural or legal person who modifies the Work under the
45
+ Licence, or otherwise contributes to the creation of a Derivative Work.
46
+
47
+ - ‘The Licensee’ or ‘You’: any natural or legal person who makes any usage of
48
+ the Work under the terms of the Licence.
49
+
50
+ - ‘Distribution’ or ‘Communication’: any act of selling, giving, lending,
51
+ renting, distributing, communicating, transmitting, or otherwise making
52
+ available, online or offline, copies of the Work or providing access to its
53
+ essential functionalities at the disposal of any other natural or legal
54
+ person.
55
+
56
+ 2. Scope of the rights granted by the Licence
57
+
58
+ The Licensor hereby grants You a worldwide, royalty-free, non-exclusive,
59
+ sublicensable licence to do the following, for the duration of copyright vested
60
+ in the Original Work:
61
+
62
+ - use the Work in any circumstance and for all usage,
63
+ - reproduce the Work,
64
+ - modify the Work, and make Derivative Works based upon the Work,
65
+ - communicate to the public, including the right to make available or display
66
+ the Work or copies thereof to the public and perform publicly, as the case may
67
+ be, the Work,
68
+ - distribute the Work or copies thereof,
69
+ - lend and rent the Work or copies thereof,
70
+ - sublicense rights in the Work or copies thereof.
71
+
72
+ Those rights can be exercised on any media, supports and formats, whether now
73
+ known or later invented, as far as the applicable law permits so.
74
+
75
+ In the countries where moral rights apply, the Licensor waives his right to
76
+ exercise his moral right to the extent allowed by law in order to make effective
77
+ the licence of the economic rights here above listed.
78
+
79
+ The Licensor grants to the Licensee royalty-free, non-exclusive usage rights to
80
+ any patents held by the Licensor, to the extent necessary to make use of the
81
+ rights granted on the Work under this Licence.
82
+
83
+ 3. Communication of the Source Code
84
+
85
+ The Licensor may provide the Work either in its Source Code form, or as
86
+ Executable Code. If the Work is provided as Executable Code, the Licensor
87
+ provides in addition a machine-readable copy of the Source Code of the Work
88
+ along with each copy of the Work that the Licensor distributes or indicates, in
89
+ a notice following the copyright notice attached to the Work, a repository where
90
+ the Source Code is easily and freely accessible for as long as the Licensor
91
+ continues to distribute or communicate the Work.
92
+
93
+ 4. Limitations on copyright
94
+
95
+ Nothing in this Licence is intended to deprive the Licensee of the benefits from
96
+ any exception or limitation to the exclusive rights of the rights owners in the
97
+ Work, of the exhaustion of those rights or of other applicable limitations
98
+ thereto.
99
+
100
+ 5. Obligations of the Licensee
101
+
102
+ The grant of the rights mentioned above is subject to some restrictions and
103
+ obligations imposed on the Licensee. Those obligations are the following:
104
+
105
+ Attribution right: The Licensee shall keep intact all copyright, patent or
106
+ trademarks notices and all notices that refer to the Licence and to the
107
+ disclaimer of warranties. The Licensee must include a copy of such notices and a
108
+ copy of the Licence with every copy of the Work he/she distributes or
109
+ communicates. The Licensee must cause any Derivative Work to carry prominent
110
+ notices stating that the Work has been modified and the date of modification.
111
+
112
+ Copyleft clause: If the Licensee distributes or communicates copies of the
113
+ Original Works or Derivative Works, this Distribution or Communication will be
114
+ done under the terms of this Licence or of a later version of this Licence
115
+ unless the Original Work is expressly distributed only under this version of the
116
+ Licence — for example by communicating ‘EUPL v. 1.2 only’. The Licensee
117
+ (becoming Licensor) cannot offer or impose any additional terms or conditions on
118
+ the Work or Derivative Work that alter or restrict the terms of the Licence.
119
+
120
+ Compatibility clause: If the Licensee Distributes or Communicates Derivative
121
+ Works or copies thereof based upon both the Work and another work licensed under
122
+ a Compatible Licence, this Distribution or Communication can be done under the
123
+ terms of this Compatible Licence. For the sake of this clause, ‘Compatible
124
+ Licence’ refers to the licences listed in the appendix attached to this Licence.
125
+ Should the Licensee's obligations under the Compatible Licence conflict with
126
+ his/her obligations under this Licence, the obligations of the Compatible
127
+ Licence shall prevail.
128
+
129
+ Provision of Source Code: When distributing or communicating copies of the Work,
130
+ the Licensee will provide a machine-readable copy of the Source Code or indicate
131
+ a repository where this Source will be easily and freely available for as long
132
+ as the Licensee continues to distribute or communicate the Work.
133
+
134
+ Legal Protection: This Licence does not grant permission to use the trade names,
135
+ trademarks, service marks, or names of the Licensor, except as required for
136
+ reasonable and customary use in describing the origin of the Work and
137
+ reproducing the content of the copyright notice.
138
+
139
+ 6. Chain of Authorship
140
+
141
+ The original Licensor warrants that the copyright in the Original Work granted
142
+ hereunder is owned by him/her or licensed to him/her and that he/she has the
143
+ power and authority to grant the Licence.
144
+
145
+ Each Contributor warrants that the copyright in the modifications he/she brings
146
+ to the Work are owned by him/her or licensed to him/her and that he/she has the
147
+ power and authority to grant the Licence.
148
+
149
+ Each time You accept the Licence, the original Licensor and subsequent
150
+ Contributors grant You a licence to their contributions to the Work, under the
151
+ terms of this Licence.
152
+
153
+ 7. Disclaimer of Warranty
154
+
155
+ The Work is a work in progress, which is continuously improved by numerous
156
+ Contributors. It is not a finished work and may therefore contain defects or
157
+ ‘bugs’ inherent to this type of development.
158
+
159
+ For the above reason, the Work is provided under the Licence on an ‘as is’ basis
160
+ and without warranties of any kind concerning the Work, including without
161
+ limitation merchantability, fitness for a particular purpose, absence of defects
162
+ or errors, accuracy, non-infringement of intellectual property rights other than
163
+ copyright as stated in Article 6 of this Licence.
164
+
165
+ This disclaimer of warranty is an essential part of the Licence and a condition
166
+ for the grant of any rights to the Work.
167
+
168
+ 8. Disclaimer of Liability
169
+
170
+ Except in the cases of wilful misconduct or damages directly caused to natural
171
+ persons, the Licensor will in no event be liable for any direct or indirect,
172
+ material or moral, damages of any kind, arising out of the Licence or of the use
173
+ of the Work, including without limitation, damages for loss of goodwill, work
174
+ stoppage, computer failure or malfunction, loss of data or any commercial
175
+ damage, even if the Licensor has been advised of the possibility of such damage.
176
+ However, the Licensor will be liable under statutory product liability laws as
177
+ far such laws apply to the Work.
178
+
179
+ 9. Additional agreements
180
+
181
+ While distributing the Work, You may choose to conclude an additional agreement,
182
+ defining obligations or services consistent with this Licence. However, if
183
+ accepting obligations, You may act only on your own behalf and on your sole
184
+ responsibility, not on behalf of the original Licensor or any other Contributor,
185
+ and only if You agree to indemnify, defend, and hold each Contributor harmless
186
+ for any liability incurred by, or claims asserted against such Contributor by
187
+ the fact You have accepted any warranty or additional liability.
188
+
189
+ 10. Acceptance of the Licence
190
+
191
+ The provisions of this Licence can be accepted by clicking on an icon ‘I agree’
192
+ placed under the bottom of a window displaying the text of this Licence or by
193
+ affirming consent in any other similar way, in accordance with the rules of
194
+ applicable law. Clicking on that icon indicates your clear and irrevocable
195
+ acceptance of this Licence and all of its terms and conditions.
196
+
197
+ Similarly, you irrevocably accept this Licence and all of its terms and
198
+ conditions by exercising any rights granted to You by Article 2 of this Licence,
199
+ such as the use of the Work, the creation by You of a Derivative Work or the
200
+ Distribution or Communication by You of the Work or copies thereof.
201
+
202
+ 11. Information to the public
203
+
204
+ In case of any Distribution or Communication of the Work by means of electronic
205
+ communication by You (for example, by offering to download the Work from a
206
+ remote location) the distribution channel or media (for example, a website) must
207
+ at least provide to the public the information requested by the applicable law
208
+ regarding the Licensor, the Licence and the way it may be accessible, concluded,
209
+ stored and reproduced by the Licensee.
210
+
211
+ 12. Termination of the Licence
212
+
213
+ The Licence and the rights granted hereunder will terminate automatically upon
214
+ any breach by the Licensee of the terms of the Licence.
215
+
216
+ Such a termination will not terminate the licences of any person who has
217
+ received the Work from the Licensee under the Licence, provided such persons
218
+ remain in full compliance with the Licence.
219
+
220
+ 13. Miscellaneous
221
+
222
+ Without prejudice of Article 9 above, the Licence represents the complete
223
+ agreement between the Parties as to the Work.
224
+
225
+ If any provision of the Licence is invalid or unenforceable under applicable
226
+ law, this will not affect the validity or enforceability of the Licence as a
227
+ whole. Such provision will be construed or reformed so as necessary to make it
228
+ valid and enforceable.
229
+
230
+ The European Commission may publish other linguistic versions or new versions of
231
+ this Licence or updated versions of the Appendix, so far this is required and
232
+ reasonable, without reducing the scope of the rights granted by the Licence. New
233
+ versions of the Licence will be published with a unique version number.
234
+
235
+ All linguistic versions of this Licence, approved by the European Commission,
236
+ have identical value. Parties can take advantage of the linguistic version of
237
+ their choice.
238
+
239
+ 14. Jurisdiction
240
+
241
+ Without prejudice to specific agreement between parties,
242
+
243
+ - any litigation resulting from the interpretation of this License, arising
244
+ between the European Union institutions, bodies, offices or agencies, as a
245
+ Licensor, and any Licensee, will be subject to the jurisdiction of the Court
246
+ of Justice of the European Union, as laid down in article 272 of the Treaty on
247
+ the Functioning of the European Union,
248
+
249
+ - any litigation arising between other parties and resulting from the
250
+ interpretation of this License, will be subject to the exclusive jurisdiction
251
+ of the competent court where the Licensor resides or conducts its primary
252
+ business.
253
+
254
+ 15. Applicable Law
255
+
256
+ Without prejudice to specific agreement between parties,
257
+
258
+ - this Licence shall be governed by the law of the European Union Member State
259
+ where the Licensor has his seat, resides or has his registered office,
260
+
261
+ - this licence shall be governed by Belgian law if the Licensor has no seat,
262
+ residence or registered office inside a European Union Member State.
263
+
264
+ Appendix
265
+
266
+ ‘Compatible Licences’ according to Article 5 EUPL are:
267
+
268
+ - GNU General Public License (GPL) v. 2, v. 3
269
+ - GNU Affero General Public License (AGPL) v. 3
270
+ - Open Software License (OSL) v. 2.1, v. 3.0
271
+ - Eclipse Public License (EPL) v. 1.0
272
+ - CeCILL v. 2.0, v. 2.1
273
+ - Mozilla Public Licence (MPL) v. 2
274
+ - GNU Lesser General Public Licence (LGPL) v. 2.1, v. 3
275
+ - Creative Commons Attribution-ShareAlike v. 3.0 Unported (CC BY-SA 3.0) for
276
+ works other than software
277
+ - European Union Public Licence (EUPL) v. 1.1, v. 1.2
278
+ - Québec Free and Open-Source Licence — Reciprocity (LiLiQ-R) or Strong
279
+ Reciprocity (LiLiQ-R+).
280
+
281
+ The European Commission may update this Appendix to later versions of the above
282
+ licences without producing a new version of the EUPL, as long as they provide
283
+ the rights granted in Article 2 of this Licence and protect the covered Source
284
+ Code from exclusive appropriation.
285
+
286
+ All other changes or additions to this Appendix require the production of a new
287
+ EUPL version.
@@ -0,0 +1,134 @@
1
+ Metadata-Version: 2.4
2
+ Name: ebm4subjects
3
+ Version: 0.4.1
4
+ Summary: Embedding Based Matching for Automated Subject Indexing
5
+ Author: Deutsche Nationalbibliothek
6
+ Maintainer-email: Clemens Rietdorf <c.rietdorf@dnb.de>, Maximilian Kähler <m.kaehler@dnb.de>
7
+ License-Expression: EUPL-1.2
8
+ License-File: LICENSE
9
+ Keywords: code4lib,machine-learning,multilabel-classification,subject-indexing,text-classification
10
+ Classifier: Development Status :: 3 - Alpha
11
+ Classifier: License :: OSI Approved :: European Union Public Licence 1.2 (EUPL 1.2)
12
+ Classifier: Operating System :: OS Independent
13
+ Classifier: Programming Language :: Python :: 3
14
+ Requires-Python: >=3.10
15
+ Requires-Dist: duckdb>=1.3.0
16
+ Requires-Dist: flash-attn>=2.8.2
17
+ Requires-Dist: nltk~=3.9.1
18
+ Requires-Dist: pandas>=2.3.0
19
+ Requires-Dist: polars>=1.30.0
20
+ Requires-Dist: pyarrow>=21.0.0
21
+ Requires-Dist: pyoxigraph>=0.4.11
22
+ Requires-Dist: rdflib~=7.1.3
23
+ Requires-Dist: sentence-transformers>=5.0.0
24
+ Requires-Dist: xgboost>=3.0.2
25
+ Description-Content-Type: text/markdown
26
+
27
+ # Embedding Based Matching for Automated Subject Indexing
28
+
29
+ **NOTE: Work in progress. This repository is still under construction.**
30
+
31
+ This repository implements an algorithm for matching subjects with
32
+ sentence transformer embeddings. While all functionality of this code
33
+ can be run independently, this repository is not intended as
34
+ standalone software, but is designed to work as a backend for the
35
+ [Annif toolkit}(https://annif.org/).
36
+
37
+ The idea of embedding based matching (EBM) is an inverted retrieval logic:
38
+ Your target vocabulary is vectorized with a sentence transformer model,
39
+ the embeddings are stored in a vector storage, enabling fast search across these
40
+ embeddings with the Hierarchical Navigable Small World Algorithm.
41
+ This enables fast semantic (embedding based) search across the
42
+ vocaublary, even for extrem large vocabularies with many synonyms.
43
+
44
+ An input text to be indexed with terms from this vocabulary is embedded with the same
45
+ sentence transformer model, and sent as a query to the vector storage, resulting in
46
+ subject candidates with embeddings that are close to the query.
47
+ Longer input texts can be chunked, resulting in multiple queries.
48
+
49
+ Finally, a ranker model is trained, that reranks the subject candidates, using some
50
+ numerical features collected during the matching process.
51
+
52
+ This design borrows a lot of ideas from lexical matching like Maui [1], Kea [2] and particularly
53
+ [Annifs](https://annif.org/) implementation in the [MLLM-Backend](https://github.com/NatLibFi/Annif/wiki/Backend%3A-MLLM) (Maui Like Lexical Matching).
54
+
55
+ [1] Medelyan, O., Frank, E., & Witten, I. H. (2009). Human-competitive tagging using automatic keyphrase extraction. ACL and AFNLP, 6–7. https://doi.org/10.5555/3454287.3454810
56
+
57
+ [2] Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C., & Nevill-Manning, C. G. (1999). Domain-Specific Keyphrase Extraction. Proceedings of the 16 Th International Joint Conference on Artifical Intelligence (IJCAI99), 668–673.
58
+
59
+
60
+ ## Why embedding based matching
61
+
62
+ Existing subject indexing methods are roughly categorized into lexical matching algortihms and statistical learning algorithms. Lexical matching algorithms search for occurences of subjects from the controlled vocabulary over a given input text on the basis of their string representation. Statistical learning tries to learn patterns between input texts and gold standard annotations from large training corpora.
63
+
64
+ Statistical learning can only predict subjects that have occured in the gold standard annotations used for training. It is uncapable of zero shot predictions. Lexical matching can find any subjects that are part of the vocabulary. Unfortunately, lexical matching often produces a large amount of false positives, as matching input texts and vocabulary solely on their string representation does not capture any semantic context. In particular, disambiguation of subjects with similar string representation is a problem.
65
+
66
+ The idea of embedding based matching is to enhance lexcial matching with the power of sentence transformer embeddings. These embeddings can capture the semantic context of the input text and allow a vector based matching that does not (solely) rely on the string representation.
67
+
68
+ Benefits of Embedding Based Matching:
69
+
70
+ * strong zero shot capabilities
71
+ * handling of synonyms and context
72
+
73
+ Disadvantages:
74
+
75
+ * creating embeddings for longer input texts with many chunks can be computationally expensive
76
+ * no generelization capabilities: statisticl learning methods can learn the usage of a vocabulary
77
+ from large amounts of training data and therefore learn associations between patterns in input
78
+ texts and vocabulary items that are beyond lexical matching or embedding similarity.
79
+ Lexical matching and embedding based matching will always stay close to the text.
80
+
81
+ ## Ranker model
82
+
83
+ The ranker model copies the idea taken from lexical matching Algorithms like MLLM or Maui, that subject candidates
84
+ can be ranked based on additional context information, e.g.
85
+
86
+ * `first_occurence`, `last_occurence`, `spread`: position (chunk number) of the subject match in a text
87
+ * `occurences`: number of occurence in a text
88
+ * `score`: sum of the similarity scores of all matches between a text chunk's embeddings and label embeddings
89
+ * `is_PrefLabelTRUE`: pref-Label or alt-Label tags in the SKOS Vocabulary
90
+
91
+ These are numerical features that can be used to train a **binary** classifier. Given a
92
+ few hundred examples with gold standard labels, the ranker is trained to
93
+ predict if a suggested candidate label is indeed a match, based on the
94
+ numerical features collected during the matching process. In contrast to
95
+ the complex extreme multi label classification problem, this is a a much simpler
96
+ problem to train a classifier for, as the selection of features that the binary classifier
97
+ is trained on, does not depend on the particular label.
98
+
99
+ Our ranker model is implemented using the [xgboost](https://xgboost.readthedocs.io/en/latest/index.html) library.
100
+
101
+ The following plot shows a variable importance plot of the xgboost Ranker-Model:
102
+
103
+
104
+ ## Embedding model
105
+
106
+ Our code uses [Jina AI Embeddings](https://huggingface.co/jinaai/jina-embeddings-v3).
107
+ These implement a technique known as Matryoshka Embedding that allows you to
108
+ flexibly choose the dimension of your embedding vectors, to find your own
109
+ cost-performance trade off.
110
+
111
+ In this demo application we use assymetric embeddings finetuned for retrieval:
112
+ Embeddings of task `retrieval.query` for embedding the vocab and embeddings of task
113
+ `retrieval.passage` for embedding the text chunks.
114
+
115
+ ## Vector storage
116
+
117
+ This project uses DuckDB (https://duckdb.org/) as storage for the vocabulary and the generated embeddings as well as one of its extensions (DuckDB's Vector Similarity Search Extension - https://duckdb.org/docs/extensions/vss.html) for indexing and querying the embeddings.
118
+ Benefits of duckdb are:
119
+
120
+ * it is served as a one-file database: no independent database server needed
121
+ * it implements vectorized HNSW-Search
122
+ * it allows parallel querying from multiple threads
123
+
124
+ In other words: duckdb allows a parallized vectorized vector search enabling
125
+ highly efficient subject retrieval even across large subject ontologies and
126
+ also with large text corpora and longer documents.
127
+
128
+ This VSS-extension allows for some configurations regarding the HNSW index and the choice of distance metric (see documentaion for details). In this project, the 'cosine' distance and the corresponding 'array_cosine_distance' function are used. The metric and function must be explicitly specified when creating and using the index and must match in order to work. To save the created index, the configuration option for the database 'hnsw_enable_experimental_persistence=true' must be set. This is not recommended by duckdb at the moment, but should not be a problem for this project as no further changes are expected once the collection has been created. Relevant and useful blog posts on the VSS Extension extension can be found here
129
+ - https://duckdb.org/2024/05/03/vector-similarity-search-vss.html
130
+ - https://duckdb.org/2024/10/23/whats-new-in-the-vss-extension.html
131
+
132
+ ## Usage
133
+
134
+ The main entry point for the package is the class `ebm_model` and its methods.
@@ -0,0 +1,108 @@
1
+ # Embedding Based Matching for Automated Subject Indexing
2
+
3
+ **NOTE: Work in progress. This repository is still under construction.**
4
+
5
+ This repository implements an algorithm for matching subjects with
6
+ sentence transformer embeddings. While all functionality of this code
7
+ can be run independently, this repository is not intended as
8
+ standalone software, but is designed to work as a backend for the
9
+ [Annif toolkit}(https://annif.org/).
10
+
11
+ The idea of embedding based matching (EBM) is an inverted retrieval logic:
12
+ Your target vocabulary is vectorized with a sentence transformer model,
13
+ the embeddings are stored in a vector storage, enabling fast search across these
14
+ embeddings with the Hierarchical Navigable Small World Algorithm.
15
+ This enables fast semantic (embedding based) search across the
16
+ vocaublary, even for extrem large vocabularies with many synonyms.
17
+
18
+ An input text to be indexed with terms from this vocabulary is embedded with the same
19
+ sentence transformer model, and sent as a query to the vector storage, resulting in
20
+ subject candidates with embeddings that are close to the query.
21
+ Longer input texts can be chunked, resulting in multiple queries.
22
+
23
+ Finally, a ranker model is trained, that reranks the subject candidates, using some
24
+ numerical features collected during the matching process.
25
+
26
+ This design borrows a lot of ideas from lexical matching like Maui [1], Kea [2] and particularly
27
+ [Annifs](https://annif.org/) implementation in the [MLLM-Backend](https://github.com/NatLibFi/Annif/wiki/Backend%3A-MLLM) (Maui Like Lexical Matching).
28
+
29
+ [1] Medelyan, O., Frank, E., & Witten, I. H. (2009). Human-competitive tagging using automatic keyphrase extraction. ACL and AFNLP, 6–7. https://doi.org/10.5555/3454287.3454810
30
+
31
+ [2] Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C., & Nevill-Manning, C. G. (1999). Domain-Specific Keyphrase Extraction. Proceedings of the 16 Th International Joint Conference on Artifical Intelligence (IJCAI99), 668–673.
32
+
33
+
34
+ ## Why embedding based matching
35
+
36
+ Existing subject indexing methods are roughly categorized into lexical matching algortihms and statistical learning algorithms. Lexical matching algorithms search for occurences of subjects from the controlled vocabulary over a given input text on the basis of their string representation. Statistical learning tries to learn patterns between input texts and gold standard annotations from large training corpora.
37
+
38
+ Statistical learning can only predict subjects that have occured in the gold standard annotations used for training. It is uncapable of zero shot predictions. Lexical matching can find any subjects that are part of the vocabulary. Unfortunately, lexical matching often produces a large amount of false positives, as matching input texts and vocabulary solely on their string representation does not capture any semantic context. In particular, disambiguation of subjects with similar string representation is a problem.
39
+
40
+ The idea of embedding based matching is to enhance lexcial matching with the power of sentence transformer embeddings. These embeddings can capture the semantic context of the input text and allow a vector based matching that does not (solely) rely on the string representation.
41
+
42
+ Benefits of Embedding Based Matching:
43
+
44
+ * strong zero shot capabilities
45
+ * handling of synonyms and context
46
+
47
+ Disadvantages:
48
+
49
+ * creating embeddings for longer input texts with many chunks can be computationally expensive
50
+ * no generelization capabilities: statisticl learning methods can learn the usage of a vocabulary
51
+ from large amounts of training data and therefore learn associations between patterns in input
52
+ texts and vocabulary items that are beyond lexical matching or embedding similarity.
53
+ Lexical matching and embedding based matching will always stay close to the text.
54
+
55
+ ## Ranker model
56
+
57
+ The ranker model copies the idea taken from lexical matching Algorithms like MLLM or Maui, that subject candidates
58
+ can be ranked based on additional context information, e.g.
59
+
60
+ * `first_occurence`, `last_occurence`, `spread`: position (chunk number) of the subject match in a text
61
+ * `occurences`: number of occurence in a text
62
+ * `score`: sum of the similarity scores of all matches between a text chunk's embeddings and label embeddings
63
+ * `is_PrefLabelTRUE`: pref-Label or alt-Label tags in the SKOS Vocabulary
64
+
65
+ These are numerical features that can be used to train a **binary** classifier. Given a
66
+ few hundred examples with gold standard labels, the ranker is trained to
67
+ predict if a suggested candidate label is indeed a match, based on the
68
+ numerical features collected during the matching process. In contrast to
69
+ the complex extreme multi label classification problem, this is a a much simpler
70
+ problem to train a classifier for, as the selection of features that the binary classifier
71
+ is trained on, does not depend on the particular label.
72
+
73
+ Our ranker model is implemented using the [xgboost](https://xgboost.readthedocs.io/en/latest/index.html) library.
74
+
75
+ The following plot shows a variable importance plot of the xgboost Ranker-Model:
76
+
77
+
78
+ ## Embedding model
79
+
80
+ Our code uses [Jina AI Embeddings](https://huggingface.co/jinaai/jina-embeddings-v3).
81
+ These implement a technique known as Matryoshka Embedding that allows you to
82
+ flexibly choose the dimension of your embedding vectors, to find your own
83
+ cost-performance trade off.
84
+
85
+ In this demo application we use assymetric embeddings finetuned for retrieval:
86
+ Embeddings of task `retrieval.query` for embedding the vocab and embeddings of task
87
+ `retrieval.passage` for embedding the text chunks.
88
+
89
+ ## Vector storage
90
+
91
+ This project uses DuckDB (https://duckdb.org/) as storage for the vocabulary and the generated embeddings as well as one of its extensions (DuckDB's Vector Similarity Search Extension - https://duckdb.org/docs/extensions/vss.html) for indexing and querying the embeddings.
92
+ Benefits of duckdb are:
93
+
94
+ * it is served as a one-file database: no independent database server needed
95
+ * it implements vectorized HNSW-Search
96
+ * it allows parallel querying from multiple threads
97
+
98
+ In other words: duckdb allows a parallized vectorized vector search enabling
99
+ highly efficient subject retrieval even across large subject ontologies and
100
+ also with large text corpora and longer documents.
101
+
102
+ This VSS-extension allows for some configurations regarding the HNSW index and the choice of distance metric (see documentaion for details). In this project, the 'cosine' distance and the corresponding 'array_cosine_distance' function are used. The metric and function must be explicitly specified when creating and using the index and must match in order to work. To save the created index, the configuration option for the database 'hnsw_enable_experimental_persistence=true' must be set. This is not recommended by duckdb at the moment, but should not be a problem for this project as no further changes are expected once the collection has been created. Relevant and useful blog posts on the VSS Extension extension can be found here
103
+ - https://duckdb.org/2024/05/03/vector-similarity-search-vss.html
104
+ - https://duckdb.org/2024/10/23/whats-new-in-the-vss-extension.html
105
+
106
+ ## Usage
107
+
108
+ The main entry point for the package is the class `ebm_model` and its methods.
@@ -0,0 +1,20 @@
1
+ # Minimal makefile for Sphinx documentation
2
+ #
3
+
4
+ # You can set these variables from the command line, and also
5
+ # from the environment for the first two.
6
+ SPHINXOPTS ?=
7
+ SPHINXBUILD ?= sphinx-build
8
+ SOURCEDIR = source
9
+ BUILDDIR = build
10
+
11
+ # Put it first so that "make" without argument is like "make help".
12
+ help:
13
+ @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14
+
15
+ .PHONY: help Makefile
16
+
17
+ # Catch-all target: route all unknown targets to Sphinx using the new
18
+ # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19
+ %: Makefile
20
+ @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
@@ -0,0 +1,35 @@
1
+ @ECHO OFF
2
+
3
+ pushd %~dp0
4
+
5
+ REM Command file for Sphinx documentation
6
+
7
+ if "%SPHINXBUILD%" == "" (
8
+ set SPHINXBUILD=sphinx-build
9
+ )
10
+ set SOURCEDIR=source
11
+ set BUILDDIR=build
12
+
13
+ %SPHINXBUILD% >NUL 2>NUL
14
+ if errorlevel 9009 (
15
+ echo.
16
+ echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
17
+ echo.installed, then set the SPHINXBUILD environment variable to point
18
+ echo.to the full path of the 'sphinx-build' executable. Alternatively you
19
+ echo.may add the Sphinx directory to PATH.
20
+ echo.
21
+ echo.If you don't have Sphinx installed, grab it from
22
+ echo.https://www.sphinx-doc.org/
23
+ exit /b 1
24
+ )
25
+
26
+ if "%1" == "" goto help
27
+
28
+ %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
29
+ goto end
30
+
31
+ :help
32
+ %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
33
+
34
+ :end
35
+ popd