glinker 0.1.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- glinker/__init__.py +54 -0
- glinker/core/__init__.py +56 -0
- glinker/core/base.py +103 -0
- glinker/core/builders.py +547 -0
- glinker/core/dag.py +898 -0
- glinker/core/factory.py +261 -0
- glinker/core/registry.py +31 -0
- glinker/l0/__init__.py +21 -0
- glinker/l0/component.py +472 -0
- glinker/l0/models.py +90 -0
- glinker/l0/processor.py +108 -0
- glinker/l1/__init__.py +15 -0
- glinker/l1/component.py +284 -0
- glinker/l1/models.py +47 -0
- glinker/l1/processor.py +152 -0
- glinker/l2/__init__.py +19 -0
- glinker/l2/component.py +1220 -0
- glinker/l2/models.py +99 -0
- glinker/l2/processor.py +170 -0
- glinker/l3/__init__.py +12 -0
- glinker/l3/component.py +184 -0
- glinker/l3/models.py +48 -0
- glinker/l3/processor.py +350 -0
- glinker/l4/__init__.py +9 -0
- glinker/l4/component.py +121 -0
- glinker/l4/models.py +21 -0
- glinker/l4/processor.py +156 -0
- glinker/py.typed +1 -0
- glinker-0.1.0.dist-info/METADATA +994 -0
- glinker-0.1.0.dist-info/RECORD +33 -0
- glinker-0.1.0.dist-info/WHEEL +5 -0
- glinker-0.1.0.dist-info/licenses/LICENSE +201 -0
- glinker-0.1.0.dist-info/top_level.txt +1 -0
|
@@ -0,0 +1,994 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: glinker
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: GLiNKER - A modular multi-layer entity linking framework
|
|
5
|
+
Author-email: Knowledgator <info@knowledgator.com>
|
|
6
|
+
License: Apache-2.0
|
|
7
|
+
Project-URL: Homepage, https://github.com/Knowledgator/GLinker
|
|
8
|
+
Project-URL: Repository, https://github.com/Knowledgator/GLinker
|
|
9
|
+
Project-URL: Documentation, https://github.com/Knowledgator/GLinker/blob/main/README.md
|
|
10
|
+
Keywords: entity-linking,nlp,gliner,spacy,ner
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: Intended Audience :: Science/Research
|
|
14
|
+
Classifier: License :: OSI Approved :: Apache Software License
|
|
15
|
+
Classifier: Programming Language :: Python :: 3
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
18
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
19
|
+
Requires-Python: >=3.10
|
|
20
|
+
Description-Content-Type: text/markdown
|
|
21
|
+
License-File: LICENSE
|
|
22
|
+
Requires-Dist: spacy>=3.7.0
|
|
23
|
+
Requires-Dist: gliner>=0.2.0
|
|
24
|
+
Requires-Dist: torch>=2.0.0
|
|
25
|
+
Requires-Dist: redis>=5.0.0
|
|
26
|
+
Requires-Dist: elasticsearch>=8.11.0
|
|
27
|
+
Requires-Dist: psycopg2-binary>=2.9.0
|
|
28
|
+
Requires-Dist: pydantic>=2.0.0
|
|
29
|
+
Requires-Dist: pyyaml>=6.0.0
|
|
30
|
+
Requires-Dist: tqdm>=4.65.0
|
|
31
|
+
Provides-Extra: dev
|
|
32
|
+
Requires-Dist: pytest>=7.4.0; extra == "dev"
|
|
33
|
+
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
|
|
34
|
+
Requires-Dist: black>=23.0.0; extra == "dev"
|
|
35
|
+
Requires-Dist: ruff>=0.1.0; extra == "dev"
|
|
36
|
+
Provides-Extra: demo
|
|
37
|
+
Requires-Dist: gradio>=4.0.0; extra == "demo"
|
|
38
|
+
Provides-Extra: all
|
|
39
|
+
Requires-Dist: glinker[demo,dev]; extra == "all"
|
|
40
|
+
Dynamic: license-file
|
|
41
|
+
|
|
42
|
+
# GLiNKER - Entity Linking Framework
|
|
43
|
+
|
|
44
|
+
<div align="center">
|
|
45
|
+
<div>
|
|
46
|
+
<a href="https://arxiv.org/abs/2406.12925"><img src="https://img.shields.io/badge/arXiv-2406.12925-b31b1b.svg" alt="GLiNER-bi-Encoder"></a>
|
|
47
|
+
<a href="https://discord.gg/HbW9aNJ9"><img alt="Discord" src="https://img.shields.io/discord/1089800235347353640?logo=discord&logoColor=white&label=Discord&color=blue"></a>
|
|
48
|
+
<a href="https://github.com/Knowledgator/GLinker/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/github/license/Knowledgator/GLinker?color=blue"></a>
|
|
49
|
+
<a href="https://hf.co/collections/knowledgator/gliner-linker"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow" alt="HuggingFace Models"></a>
|
|
50
|
+
<a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License: Apache 2.0"></a>
|
|
51
|
+
<a href="https://pypi.org/project/glinker/"><img src="https://badge.fury.io/py/glinker.svg" alt="PyPI version"></a>
|
|
52
|
+
</div>
|
|
53
|
+
<br>
|
|
54
|
+
</div>
|
|
55
|
+
|
|
56
|
+

|
|
57
|
+
|
|
58
|
+
> A modular, production-ready entity linking framework combining NER, multi-layer database search, and neural entity disambiguation.
|
|
59
|
+
|
|
60
|
+
## Overview
|
|
61
|
+
|
|
62
|
+
GLiNKER is a modular entity linking pipeline that transforms raw text into structured, disambiguated entity mentions. It's designed for:
|
|
63
|
+
|
|
64
|
+
- **Production use**: Multi-layer caching (Redis → Elasticsearch → PostgreSQL)
|
|
65
|
+
- **Research flexibility**: Fully configurable YAML pipelines
|
|
66
|
+
- **Performance**: Embedding precomputation for BiEncoder models
|
|
67
|
+
- **Scalability**: DAG-based execution with batch processing
|
|
68
|
+
|
|
69
|
+
|
|
70
|
+
GLiNKER is built around GLiNER — a family of lightweight, generalist models for information extraction. It brings several key advantages to the entity linking pipeline:
|
|
71
|
+
|
|
72
|
+
- **Zero-shot recognition** — Identify any entity type by simply providing label names. No fine-tuning or annotated data required. Switch from biomedical genes to legal entities by changing a list of strings.
|
|
73
|
+
- **Unified architecture** — A single model handles both NER (L1) and entity disambiguation (L3/L4), reducing deployment complexity and keeping the inference stack consistent.
|
|
74
|
+
- **Efficient BiEncoder support** — BiEncoder variants allow precomputing label embeddings once and reusing them across millions of documents, delivering 10–100× speedups for large-scale linking.
|
|
75
|
+
- **Compact and fast** — Base models are small enough to run on CPU, while larger variants scale with GPU for production throughput.
|
|
76
|
+
- **Open and extensible** — Apache 2.0 licensed models on Hugging Face, easy to swap for domain-specific fine-tunes when needed.
|
|
77
|
+
|
|
78
|
+
|
|
79
|
+
### Traditional vs GLiNKER Approach
|
|
80
|
+
|
|
81
|
+
```python
|
|
82
|
+
# Traditional approach: Complex, coupled code
|
|
83
|
+
ner_results = spacy_model(text)
|
|
84
|
+
candidates = search_database(ner_results)
|
|
85
|
+
linked = gliner_model.disambiguate(candidates)
|
|
86
|
+
# Mix of models, databases, and business logic
|
|
87
|
+
|
|
88
|
+
# GLiNKER approach: Declarative configuration
|
|
89
|
+
from glinker import ConfigBuilder, DAGExecutor
|
|
90
|
+
|
|
91
|
+
builder = ConfigBuilder(name="biomedical_el")
|
|
92
|
+
builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "protein", "disease"])
|
|
93
|
+
builder.l2.add("redis", priority=2).add("postgres", priority=0)
|
|
94
|
+
builder.l3.configure(model="knowledgator/gliner-linker-large-v1.0")
|
|
95
|
+
|
|
96
|
+
executor = DAGExecutor(builder.get_config())
|
|
97
|
+
result = executor.execute({"texts": ["CRISPR-Cas9 enables precise gene therapy"]})
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
## Table of Contents
|
|
101
|
+
|
|
102
|
+
- [Quick Start](#quick-start)
|
|
103
|
+
- [Creating Pipelines](#creating-pipelines)
|
|
104
|
+
- [Option 1: create_simple (recommended start)](#option-1-create_simple-recommended-start)
|
|
105
|
+
- [Option 2: From a YAML config file](#option-2-from-a-yaml-config-file)
|
|
106
|
+
- [Option 3: ConfigBuilder (programmatic)](#option-3-configbuilder-programmatic)
|
|
107
|
+
- [Loading Entities](#loading-entities)
|
|
108
|
+
- [From a JSONL file](#from-a-jsonl-file)
|
|
109
|
+
- [From a Python list](#from-a-python-list)
|
|
110
|
+
- [From a Python dict](#from-a-python-dict)
|
|
111
|
+
- [Entity format reference](#entity-format-reference)
|
|
112
|
+
- [Architecture](#architecture)
|
|
113
|
+
- [Features](#features)
|
|
114
|
+
- [YAML Configuration Reference](#yaml-configuration-reference)
|
|
115
|
+
- [Advanced Features](#advanced-features)
|
|
116
|
+
- [Database Setup](#database-setup)
|
|
117
|
+
- [Testing](#testing)
|
|
118
|
+
- [Citations](#citations)
|
|
119
|
+
|
|
120
|
+
## Quick Start
|
|
121
|
+
|
|
122
|
+
### Installation
|
|
123
|
+
|
|
124
|
+
Install easily using pip:
|
|
125
|
+
|
|
126
|
+
```bash
|
|
127
|
+
pip install glinker
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
Or install from source:
|
|
131
|
+
|
|
132
|
+
```bash
|
|
133
|
+
git clone https://github.com/Knowledgator/GLinker.git
|
|
134
|
+
cd GLinker
|
|
135
|
+
pip install -e .
|
|
136
|
+
|
|
137
|
+
# With optional dependencies
|
|
138
|
+
pip install -e ".[dev,demo]"
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
### 30-Second Example
|
|
142
|
+
|
|
143
|
+
```python
|
|
144
|
+
from glinker import ConfigBuilder, DAGExecutor
|
|
145
|
+
|
|
146
|
+
# 1. Build configuration
|
|
147
|
+
builder = ConfigBuilder(name="demo")
|
|
148
|
+
builder.l1.spacy(model="en_core_web_sm")
|
|
149
|
+
builder.l3.configure(model="knowledgator/gliner-linker-large-v1.0")
|
|
150
|
+
|
|
151
|
+
# 2. Create executor
|
|
152
|
+
executor = DAGExecutor(builder.get_config())
|
|
153
|
+
|
|
154
|
+
# 3. Load entities
|
|
155
|
+
executor.load_entities("data/entities.jsonl", target_layers=["dict"])
|
|
156
|
+
|
|
157
|
+
# 4. Process text
|
|
158
|
+
result = executor.execute({
|
|
159
|
+
"texts": ["Farnese Palace is one of the most important palaces in the city of Rome."]
|
|
160
|
+
})
|
|
161
|
+
|
|
162
|
+
# 5. Get results
|
|
163
|
+
l0_result = result.get("l0_result")
|
|
164
|
+
for entity in l0_result.entities:
|
|
165
|
+
if entity.linked_entity:
|
|
166
|
+
print(f"{entity.mention_text} → {entity.linked_entity.label}")
|
|
167
|
+
print(f" Confidence: {entity.linked_entity.score:.3f}")
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
**Output:**
|
|
171
|
+
```
|
|
172
|
+
BRCA1 → BRCA1: Breast cancer type 1 susceptibility protein
|
|
173
|
+
Confidence: 0.923
|
|
174
|
+
breast cancer → Breast Cancer: Malignant neoplasm of the breast
|
|
175
|
+
Confidence: 0.887
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
---
|
|
179
|
+
|
|
180
|
+
## Creating Pipelines
|
|
181
|
+
|
|
182
|
+
GLiNKER offers three ways to create a pipeline, from simplest to most configurable.
|
|
183
|
+
|
|
184
|
+
### Option 1: `create_simple` (recommended start)
|
|
185
|
+
|
|
186
|
+
`ProcessorFactory.create_simple` builds a **L2 → L3 → L0** pipeline in one call. No NER step — the model links entities directly from the input text against all loaded entities.
|
|
187
|
+
|
|
188
|
+
```python
|
|
189
|
+
from glinker import ProcessorFactory
|
|
190
|
+
|
|
191
|
+
# Minimal — just a model name
|
|
192
|
+
executor = ProcessorFactory.create_simple(
|
|
193
|
+
model_name="knowledgator/gliner-bi-base-v2.0",
|
|
194
|
+
threshold=0.5,
|
|
195
|
+
)
|
|
196
|
+
|
|
197
|
+
# Load entities and run
|
|
198
|
+
executor.load_entities("data/entities.jsonl")
|
|
199
|
+
result = executor.execute({"texts": ["CRISPR-Cas9 enables precise gene therapy."]})
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
**With inline entities (no file needed):**
|
|
203
|
+
|
|
204
|
+
```python
|
|
205
|
+
executor = ProcessorFactory.create_simple(
|
|
206
|
+
model_name="knowledgator/gliner-bi-base-v2.0",
|
|
207
|
+
threshold=0.5,
|
|
208
|
+
entities=[
|
|
209
|
+
{"entity_id": "Q101", "label": "insulin", "description": "Peptide hormone regulating blood glucose"},
|
|
210
|
+
{"entity_id": "Q102", "label": "glucose", "description": "Primary blood sugar and key metabolic fuel"},
|
|
211
|
+
{"entity_id": "Q103", "label": "GLUT4", "description": "Insulin-responsive glucose transporter in muscle and adipose tissue"},
|
|
212
|
+
{"entity_id": "Q104", "label": "pancreatic beta cell", "description": "Endocrine cell type that secretes insulin"},
|
|
213
|
+
],
|
|
214
|
+
)
|
|
215
|
+
|
|
216
|
+
result = executor.execute({
|
|
217
|
+
"texts": [
|
|
218
|
+
"After a meal, pancreatic beta cells release insulin, which promotes GLUT4 translocation and increases glucose uptake in muscle."
|
|
219
|
+
]
|
|
220
|
+
})
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
**With a reranker (L2 → L3 → L4 → L0):**
|
|
224
|
+
|
|
225
|
+
```python
|
|
226
|
+
executor = ProcessorFactory.create_simple(
|
|
227
|
+
model_name="knowledgator/gliner-bi-base-v2.0",
|
|
228
|
+
threshold=0.5,
|
|
229
|
+
reranker_model="knowledgator/gliner-multitask-large-v0.5",
|
|
230
|
+
reranker_max_labels=20,
|
|
231
|
+
reranker_threshold=0.3,
|
|
232
|
+
entities="data/entities.jsonl",
|
|
233
|
+
precompute_embeddings=True,
|
|
234
|
+
)
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
**With entity descriptions in the template:**
|
|
238
|
+
|
|
239
|
+
```python
|
|
240
|
+
executor = ProcessorFactory.create_simple(
|
|
241
|
+
model_name="knowledgator/gliner-bi-base-v2.0",
|
|
242
|
+
template="{label}: {description}", # L3 sees "BRCA1: Breast cancer type 1 susceptibility protein"
|
|
243
|
+
entities="data/entities.jsonl",
|
|
244
|
+
)
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
**All `create_simple` parameters:**
|
|
248
|
+
|
|
249
|
+
| Parameter | Default | Description |
|
|
250
|
+
|-----------|---------|-------------|
|
|
251
|
+
| `model_name` | *(required)* | HuggingFace model ID or local path |
|
|
252
|
+
| `device` | `"cpu"` | Torch device (`"cpu"`, `"cuda"`, `"cuda:0"`) |
|
|
253
|
+
| `threshold` | `0.5` | Minimum score for entity predictions |
|
|
254
|
+
| `template` | `"{label}"` | Format string for entity labels (e.g. `"{label}: {description}"`) |
|
|
255
|
+
| `max_length` | `512` | Max sequence length for tokenization |
|
|
256
|
+
| `token` | `None` | HuggingFace auth token for gated models |
|
|
257
|
+
| `entities` | `None` | Entity data to load immediately (file path, list of dicts, or dict of dicts) |
|
|
258
|
+
| `precompute_embeddings` | `False` | Pre-embed all entity labels after loading (BiEncoder only) |
|
|
259
|
+
| `verbose` | `False` | Enable verbose logging |
|
|
260
|
+
| `reranker_model` | `None` | GLiNER model for L4 reranking (adds L4 node when set) |
|
|
261
|
+
| `reranker_max_labels` | `20` | Max candidate labels per L4 inference call |
|
|
262
|
+
| `reranker_threshold` | `None` | Score threshold for L4 (defaults to `threshold`) |
|
|
263
|
+
|
|
264
|
+
### Option 2: From a YAML config file
|
|
265
|
+
|
|
266
|
+
For full control over every layer, define the pipeline in YAML and load it:
|
|
267
|
+
|
|
268
|
+
```python
|
|
269
|
+
from glinker import ProcessorFactory
|
|
270
|
+
|
|
271
|
+
executor = ProcessorFactory.create_pipeline("configs/pipelines/dict/simple.yaml")
|
|
272
|
+
executor.load_entities("data/entities.jsonl")
|
|
273
|
+
result = executor.execute({"texts": ["TP53 mutations cause cancer"]})
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
See [YAML Configuration Reference](#yaml-configuration-reference) for full config examples.
|
|
277
|
+
|
|
278
|
+
### Option 3: `ConfigBuilder` (programmatic)
|
|
279
|
+
|
|
280
|
+
Build configs in Python with full control over each layer:
|
|
281
|
+
|
|
282
|
+
```python
|
|
283
|
+
from glinker import ConfigBuilder, DAGExecutor
|
|
284
|
+
|
|
285
|
+
builder = ConfigBuilder(name="my_pipeline")
|
|
286
|
+
builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "disease"])
|
|
287
|
+
builder.l3.configure(model="knowledgator/gliner-linker-large-v1.0")
|
|
288
|
+
|
|
289
|
+
executor = DAGExecutor(builder.get_config())
|
|
290
|
+
executor.load_entities("data/entities.jsonl", target_layers=["dict"])
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
**With multiple database layers:**
|
|
294
|
+
|
|
295
|
+
```python
|
|
296
|
+
builder = ConfigBuilder(name="production")
|
|
297
|
+
builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "protein"])
|
|
298
|
+
builder.l2.add("redis", priority=2, ttl=3600)
|
|
299
|
+
builder.l2.add("elasticsearch", priority=1, ttl=86400)
|
|
300
|
+
builder.l2.add("postgres", priority=0)
|
|
301
|
+
builder.l3.configure(model="knowledgator/gliner-linker-large-v1.0", use_precomputed_embeddings=True)
|
|
302
|
+
builder.l0.configure(strict_matching=True, min_confidence=0.3)
|
|
303
|
+
builder.save("config.yaml")
|
|
304
|
+
```
|
|
305
|
+
|
|
306
|
+
**With L4 reranker:**
|
|
307
|
+
|
|
308
|
+
```python
|
|
309
|
+
builder = ConfigBuilder(name="reranked")
|
|
310
|
+
builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "disease"])
|
|
311
|
+
builder.l3.configure(model="knowledgator/gliner-linker-base-v1.0")
|
|
312
|
+
builder.l4.configure(
|
|
313
|
+
model="knowledgator/gliner-multitask-large-v0.5",
|
|
314
|
+
threshold=0.3,
|
|
315
|
+
max_labels=20,
|
|
316
|
+
)
|
|
317
|
+
builder.save("config.yaml") # Generates L1 → L2 → L3 → L4 → L0
|
|
318
|
+
```
|
|
319
|
+
|
|
320
|
+
---
|
|
321
|
+
|
|
322
|
+
## Loading Entities
|
|
323
|
+
|
|
324
|
+
Entities can be loaded after pipeline creation via `executor.load_entities()`, or passed directly to `create_simple(entities=...)`. Three input formats are supported.
|
|
325
|
+
|
|
326
|
+
### From a JSONL file
|
|
327
|
+
|
|
328
|
+
One JSON object per line:
|
|
329
|
+
|
|
330
|
+
```python
|
|
331
|
+
executor.load_entities("data/entities.jsonl")
|
|
332
|
+
|
|
333
|
+
# Or target specific database layers
|
|
334
|
+
executor.load_entities("data/entities.jsonl", target_layers=["dict", "postgres"])
|
|
335
|
+
```
|
|
336
|
+
|
|
337
|
+
**`data/entities.jsonl`:**
|
|
338
|
+
|
|
339
|
+
```jsonl
|
|
340
|
+
{"entity_id": "Q123", "label": "Kyiv", "description": "Capital and largest city of Ukraine", "entity_type": "city", "popularity": 1000000, "aliases": ["Kiev"]}
|
|
341
|
+
{"entity_id": "Q456", "label": "Dnipro River", "description": "Major river flowing through Ukraine and Belarus", "entity_type": "river", "popularity": 950000, "aliases": ["Dnieper"]}
|
|
342
|
+
{"entity_id": "Q789", "label": "Carpathian Mountains", "description": "Mountain range in Central and Eastern Europe", "entity_type": "mountain_range", "popularity": 800000, "aliases": ["Carpathians"]}
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
### From a Python list
|
|
346
|
+
|
|
347
|
+
```python
|
|
348
|
+
entities = [
|
|
349
|
+
{
|
|
350
|
+
"entity_id": "Q123",
|
|
351
|
+
"label": "Kyiv",
|
|
352
|
+
"description": "Capital and largest city of Ukraine",
|
|
353
|
+
"entity_type": "city",
|
|
354
|
+
"aliases": ["Kiev"],
|
|
355
|
+
},
|
|
356
|
+
{
|
|
357
|
+
"entity_id": "Q456",
|
|
358
|
+
"label": "Dnipro River",
|
|
359
|
+
"description": "Major river flowing through Ukraine and Belarus",
|
|
360
|
+
"entity_type": "river",
|
|
361
|
+
"aliases": ["Dnieper"],
|
|
362
|
+
},
|
|
363
|
+
]
|
|
364
|
+
|
|
365
|
+
executor.load_entities(entities)
|
|
366
|
+
```
|
|
367
|
+
|
|
368
|
+
### From a Python dict
|
|
369
|
+
|
|
370
|
+
Keys are entity IDs, values are entity data:
|
|
371
|
+
|
|
372
|
+
```python
|
|
373
|
+
entities = {
|
|
374
|
+
"Q123": {
|
|
375
|
+
"label": "Kyiv",
|
|
376
|
+
"description": "Capital and largest city of Ukraine",
|
|
377
|
+
"entity_type": "city",
|
|
378
|
+
},
|
|
379
|
+
"Q456": {
|
|
380
|
+
"label": "Dnipro River",
|
|
381
|
+
"description": "Major river flowing through Ukraine and Belarus",
|
|
382
|
+
"entity_type": "river",
|
|
383
|
+
},
|
|
384
|
+
}
|
|
385
|
+
|
|
386
|
+
executor.load_entities(entities)
|
|
387
|
+
```
|
|
388
|
+
|
|
389
|
+
|
|
390
|
+
### Entity format reference
|
|
391
|
+
|
|
392
|
+
| Field | Type | Required | Default | Description |
|
|
393
|
+
|-------|------|----------|---------|-------------|
|
|
394
|
+
| `entity_id` | str | **yes** | — | Unique identifier |
|
|
395
|
+
| `label` | str | **yes** | — | Primary name |
|
|
396
|
+
| `description` | str | no | `""` | Text description (used in templates like `"{label}: {description}"`) |
|
|
397
|
+
| `entity_type` | str | no | `""` | Category (e.g. `"gene"`, `"disease"`) |
|
|
398
|
+
| `aliases` | list[str] | no | `[]` | Alternative names for search matching |
|
|
399
|
+
| `popularity` | int | no | `0` | Ranking score for candidate ordering |
|
|
400
|
+
|
|
401
|
+
---
|
|
402
|
+
|
|
403
|
+
## Architecture
|
|
404
|
+
|
|
405
|
+
GLiNKER uses a **layered pipeline** with an optional reranking stage:
|
|
406
|
+
|
|
407
|
+

|
|
408
|
+
|
|
409
|
+
| Layer | Purpose | Processor |
|
|
410
|
+
|-------|---------|-----------|
|
|
411
|
+
| **L1** | Mention extraction (spaCy or GLiNER NER) | `l1_spacy`, `l1_gliner` |
|
|
412
|
+
| **L2** | Candidate retrieval from database layers | `l2_chain` |
|
|
413
|
+
| **L3** | Entity disambiguation via GLiNER | `l3_batch` |
|
|
414
|
+
| **L4** | *(Optional)* GLiNER reranking with candidate chunking | `l4_reranker` |
|
|
415
|
+
| **L0** | Aggregation, filtering, and final output | `l0_aggregator` |
|
|
416
|
+
|
|
417
|
+
**Supported topologies:**
|
|
418
|
+
```
|
|
419
|
+
Full pipeline: L1 → L2 → L3 → L0
|
|
420
|
+
With reranking: L1 → L2 → L3 → L4 → L0
|
|
421
|
+
Simple (no NER): L2 → L3 → L0
|
|
422
|
+
Simple + reranker: L2 → L4 → L0
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
**Key Concepts:**
|
|
426
|
+
|
|
427
|
+
- **DAG Execution**: Layers execute in dependency order with automatic data flow
|
|
428
|
+
- **Component-Processor Pattern**: Each layer has a Component (methods) and Processor (orchestration)
|
|
429
|
+
- **Schema Consistency**: Single template (e.g., `"{label}: {description}"`) across layers
|
|
430
|
+
- **Cache Hierarchy**: Upper layers cache results from lower layers automatically
|
|
431
|
+
|
|
432
|
+
---
|
|
433
|
+
|
|
434
|
+
## Features
|
|
435
|
+
|
|
436
|
+
### Multiple NER Backends
|
|
437
|
+
- **spaCy** — Fast, rule-based NER for standard use cases
|
|
438
|
+
- **GLiNER** — Neural NER with custom labels (no training required)
|
|
439
|
+
|
|
440
|
+
### Multi-Layer Database Support
|
|
441
|
+
- **Dict** — In-memory (perfect for demos)
|
|
442
|
+
- **Redis** — Fast cache (production)
|
|
443
|
+
- **Elasticsearch** — Full-text search with fuzzy matching
|
|
444
|
+
- **PostgreSQL** — Persistent storage with pg_trgm fuzzy search
|
|
445
|
+
|
|
446
|
+
### Performance Optimization
|
|
447
|
+
- **Embedding Precomputation** — Cache label embeddings for BiEncoder models
|
|
448
|
+
- **Cache Hierarchy** — Automatic write-back: Redis → ES → PostgreSQL
|
|
449
|
+
- **Batch Processing** — Efficient parallel processing
|
|
450
|
+
|
|
451
|
+
### L4 Reranker (Optional)
|
|
452
|
+
|
|
453
|
+
When the candidate set from L2 is large (tens or hundreds of entities), a single GLiNER call may be impractical. The **L4 reranker** solves this by splitting candidates into chunks:
|
|
454
|
+
|
|
455
|
+
```
|
|
456
|
+
100 candidates, max_labels=20 → 5 GLiNER inference calls
|
|
457
|
+
Results merged, deduplicated, filtered by threshold
|
|
458
|
+
```
|
|
459
|
+
|
|
460
|
+
L4 uses a **uni-encoder GLiNER model** and can be placed after L3 (true reranking) or used directly after L2 (replacing L3):
|
|
461
|
+
|
|
462
|
+
```python
|
|
463
|
+
# Via ConfigBuilder
|
|
464
|
+
builder.l4.configure(
|
|
465
|
+
model="knowledgator/gliner-multitask-large-v0.5",
|
|
466
|
+
threshold=0.3,
|
|
467
|
+
max_labels=20 # candidates per inference call
|
|
468
|
+
)
|
|
469
|
+
|
|
470
|
+
# Via create_simple
|
|
471
|
+
executor = ProcessorFactory.create_simple(
|
|
472
|
+
model_name="knowledgator/gliner-bi-base-v2.0",
|
|
473
|
+
reranker_model="knowledgator/gliner-multitask-large-v0.5",
|
|
474
|
+
reranker_max_labels=20,
|
|
475
|
+
)
|
|
476
|
+
```
|
|
477
|
+
|
|
478
|
+
---
|
|
479
|
+
|
|
480
|
+
## YAML Configuration Reference
|
|
481
|
+
|
|
482
|
+
YAML configs give full control over every node in the pipeline. Load them with:
|
|
483
|
+
|
|
484
|
+
```python
|
|
485
|
+
from glinker import ProcessorFactory
|
|
486
|
+
|
|
487
|
+
executor = ProcessorFactory.create_pipeline("path/to/config.yaml")
|
|
488
|
+
```
|
|
489
|
+
|
|
490
|
+
### Simple pipeline (L2 → L3 → L0, no NER)
|
|
491
|
+
|
|
492
|
+
Equivalent to `create_simple`. No L1 node — texts are passed directly to L2/L3:
|
|
493
|
+
|
|
494
|
+
```yaml
|
|
495
|
+
name: "simple"
|
|
496
|
+
description: "Simple pipeline - L3 only with entity database"
|
|
497
|
+
|
|
498
|
+
nodes:
|
|
499
|
+
- id: "l2"
|
|
500
|
+
processor: "l2_chain"
|
|
501
|
+
inputs:
|
|
502
|
+
texts:
|
|
503
|
+
source: "$input"
|
|
504
|
+
fields: "texts"
|
|
505
|
+
output:
|
|
506
|
+
key: "l2_result"
|
|
507
|
+
schema:
|
|
508
|
+
template: "{label}"
|
|
509
|
+
config:
|
|
510
|
+
max_candidates: 30
|
|
511
|
+
min_popularity: 0
|
|
512
|
+
layers:
|
|
513
|
+
- type: "dict"
|
|
514
|
+
priority: 0
|
|
515
|
+
write: true
|
|
516
|
+
search_mode: ["exact"]
|
|
517
|
+
|
|
518
|
+
- id: "l3"
|
|
519
|
+
processor: "l3_batch"
|
|
520
|
+
requires: ["l2"]
|
|
521
|
+
inputs:
|
|
522
|
+
texts:
|
|
523
|
+
source: "$input"
|
|
524
|
+
fields: "texts"
|
|
525
|
+
candidates:
|
|
526
|
+
source: "l2_result"
|
|
527
|
+
fields: "candidates"
|
|
528
|
+
output:
|
|
529
|
+
key: "l3_result"
|
|
530
|
+
schema:
|
|
531
|
+
template: "{label}"
|
|
532
|
+
config:
|
|
533
|
+
model_name: "knowledgator/gliner-bi-base-v2.0"
|
|
534
|
+
device: "cpu"
|
|
535
|
+
threshold: 0.5
|
|
536
|
+
flat_ner: true
|
|
537
|
+
multi_label: false
|
|
538
|
+
use_precomputed_embeddings: true
|
|
539
|
+
cache_embeddings: false
|
|
540
|
+
max_length: 512
|
|
541
|
+
|
|
542
|
+
- id: "l0"
|
|
543
|
+
processor: "l0_aggregator"
|
|
544
|
+
requires: ["l2", "l3"]
|
|
545
|
+
inputs:
|
|
546
|
+
l2_candidates:
|
|
547
|
+
source: "l2_result"
|
|
548
|
+
fields: "candidates"
|
|
549
|
+
l3_entities:
|
|
550
|
+
source: "l3_result"
|
|
551
|
+
fields: "entities"
|
|
552
|
+
output:
|
|
553
|
+
key: "l0_result"
|
|
554
|
+
config:
|
|
555
|
+
strict_matching: false
|
|
556
|
+
min_confidence: 0.0
|
|
557
|
+
include_unlinked: true
|
|
558
|
+
position_tolerance: 2
|
|
559
|
+
```
|
|
560
|
+
|
|
561
|
+
### Full pipeline with spaCy NER (L1 → L2 → L3 → L0)
|
|
562
|
+
|
|
563
|
+
```yaml
|
|
564
|
+
name: "dict_default"
|
|
565
|
+
description: "In-memory dict layer with spaCy NER"
|
|
566
|
+
|
|
567
|
+
nodes:
|
|
568
|
+
- id: "l1"
|
|
569
|
+
processor: "l1_spacy"
|
|
570
|
+
inputs:
|
|
571
|
+
texts:
|
|
572
|
+
source: "$input"
|
|
573
|
+
fields: "texts"
|
|
574
|
+
output:
|
|
575
|
+
key: "l1_result"
|
|
576
|
+
config:
|
|
577
|
+
model: "en_core_sci_sm"
|
|
578
|
+
device: "cpu"
|
|
579
|
+
batch_size: 1
|
|
580
|
+
min_entity_length: 2
|
|
581
|
+
include_noun_chunks: true
|
|
582
|
+
|
|
583
|
+
- id: "l2"
|
|
584
|
+
processor: "l2_chain"
|
|
585
|
+
requires: ["l1"]
|
|
586
|
+
inputs:
|
|
587
|
+
mentions:
|
|
588
|
+
source: "l1_result"
|
|
589
|
+
fields: "entities"
|
|
590
|
+
output:
|
|
591
|
+
key: "l2_result"
|
|
592
|
+
schema:
|
|
593
|
+
template: "{label}: {description}"
|
|
594
|
+
config:
|
|
595
|
+
max_candidates: 5
|
|
596
|
+
layers:
|
|
597
|
+
- type: "dict"
|
|
598
|
+
priority: 0
|
|
599
|
+
write: true
|
|
600
|
+
search_mode: ["exact", "fuzzy"]
|
|
601
|
+
fuzzy:
|
|
602
|
+
max_distance: 64
|
|
603
|
+
min_similarity: 0.6
|
|
604
|
+
|
|
605
|
+
- id: "l3"
|
|
606
|
+
processor: "l3_batch"
|
|
607
|
+
requires: ["l2"]
|
|
608
|
+
inputs:
|
|
609
|
+
texts:
|
|
610
|
+
source: "$input"
|
|
611
|
+
fields: "texts"
|
|
612
|
+
candidates:
|
|
613
|
+
source: "l2_result"
|
|
614
|
+
fields: "candidates"
|
|
615
|
+
output:
|
|
616
|
+
key: "l3_result"
|
|
617
|
+
schema:
|
|
618
|
+
template: "{label}: {description}"
|
|
619
|
+
config:
|
|
620
|
+
model_name: "knowledgator/gliner-linker-large-v1.0"
|
|
621
|
+
device: "cpu"
|
|
622
|
+
threshold: 0.5
|
|
623
|
+
flat_ner: true
|
|
624
|
+
multi_label: false
|
|
625
|
+
max_length: 512
|
|
626
|
+
|
|
627
|
+
- id: "l0"
|
|
628
|
+
processor: "l0_aggregator"
|
|
629
|
+
requires: ["l1", "l2", "l3"]
|
|
630
|
+
inputs:
|
|
631
|
+
l1_entities:
|
|
632
|
+
source: "l1_result"
|
|
633
|
+
fields: "entities"
|
|
634
|
+
l2_candidates:
|
|
635
|
+
source: "l2_result"
|
|
636
|
+
fields: "candidates"
|
|
637
|
+
l3_entities:
|
|
638
|
+
source: "l3_result"
|
|
639
|
+
fields: "entities"
|
|
640
|
+
output:
|
|
641
|
+
key: "l0_result"
|
|
642
|
+
config:
|
|
643
|
+
strict_matching: true
|
|
644
|
+
min_confidence: 0.0
|
|
645
|
+
include_unlinked: true
|
|
646
|
+
position_tolerance: 2
|
|
647
|
+
```
|
|
648
|
+
|
|
649
|
+
### Pipeline with L4 reranker (L1 → L2 → L3 → L4 → L0)
|
|
650
|
+
|
|
651
|
+
Use when the candidate set is large. L4 splits candidates into chunks of `max_labels` and runs GLiNER inference on each chunk:
|
|
652
|
+
|
|
653
|
+
```yaml
|
|
654
|
+
name: "dict_reranker"
|
|
655
|
+
description: "In-memory dict with L4 GLiNER reranking"
|
|
656
|
+
|
|
657
|
+
nodes:
|
|
658
|
+
- id: "l1"
|
|
659
|
+
processor: "l1_gliner"
|
|
660
|
+
inputs:
|
|
661
|
+
texts:
|
|
662
|
+
source: "$input"
|
|
663
|
+
fields: "texts"
|
|
664
|
+
output:
|
|
665
|
+
key: "l1_result"
|
|
666
|
+
config:
|
|
667
|
+
model: "knowledgator/gliner-bi-base-v2.0"
|
|
668
|
+
labels: ["gene", "drug", "disease", "person", "organization"]
|
|
669
|
+
device: "cpu"
|
|
670
|
+
|
|
671
|
+
- id: "l2"
|
|
672
|
+
processor: "l2_chain"
|
|
673
|
+
requires: ["l1"]
|
|
674
|
+
inputs:
|
|
675
|
+
mentions:
|
|
676
|
+
source: "l1_result"
|
|
677
|
+
fields: "entities"
|
|
678
|
+
output:
|
|
679
|
+
key: "l2_result"
|
|
680
|
+
schema:
|
|
681
|
+
template: "{label}: {description}"
|
|
682
|
+
config:
|
|
683
|
+
max_candidates: 100
|
|
684
|
+
layers:
|
|
685
|
+
- type: "dict"
|
|
686
|
+
priority: 0
|
|
687
|
+
write: true
|
|
688
|
+
search_mode: ["exact", "fuzzy"]
|
|
689
|
+
|
|
690
|
+
- id: "l3"
|
|
691
|
+
processor: "l3_batch"
|
|
692
|
+
requires: ["l1", "l2"]
|
|
693
|
+
inputs:
|
|
694
|
+
texts:
|
|
695
|
+
source: "$input"
|
|
696
|
+
fields: "texts"
|
|
697
|
+
candidates:
|
|
698
|
+
source: "l2_result"
|
|
699
|
+
fields: "candidates"
|
|
700
|
+
output:
|
|
701
|
+
key: "l3_result"
|
|
702
|
+
schema:
|
|
703
|
+
template: "{label}: {description}"
|
|
704
|
+
config:
|
|
705
|
+
model_name: "knowledgator/gliner-linker-base-v1.0"
|
|
706
|
+
device: "cpu"
|
|
707
|
+
threshold: 0.5
|
|
708
|
+
use_precomputed_embeddings: true
|
|
709
|
+
|
|
710
|
+
- id: "l4"
|
|
711
|
+
processor: "l4_reranker"
|
|
712
|
+
requires: ["l1", "l2", "l3"]
|
|
713
|
+
inputs:
|
|
714
|
+
texts:
|
|
715
|
+
source: "$input"
|
|
716
|
+
fields: "texts"
|
|
717
|
+
candidates:
|
|
718
|
+
source: "l2_result"
|
|
719
|
+
fields: "candidates"
|
|
720
|
+
l1_entities:
|
|
721
|
+
source: "l1_result"
|
|
722
|
+
fields: "entities"
|
|
723
|
+
output:
|
|
724
|
+
key: "l4_result"
|
|
725
|
+
schema:
|
|
726
|
+
template: "{label}: {description}"
|
|
727
|
+
config:
|
|
728
|
+
model_name: "knowledgator/gliner-multitask-large-v0.5"
|
|
729
|
+
device: "cpu"
|
|
730
|
+
threshold: 0.3
|
|
731
|
+
max_labels: 20 # candidates per inference call
|
|
732
|
+
|
|
733
|
+
- id: "l0"
|
|
734
|
+
processor: "l0_aggregator"
|
|
735
|
+
requires: ["l1", "l2", "l4"]
|
|
736
|
+
inputs:
|
|
737
|
+
l1_entities:
|
|
738
|
+
source: "l1_result"
|
|
739
|
+
fields: "entities"
|
|
740
|
+
l2_candidates:
|
|
741
|
+
source: "l2_result"
|
|
742
|
+
fields: "candidates"
|
|
743
|
+
l3_entities:
|
|
744
|
+
source: "l4_result" # L0 reads from L4 instead of L3
|
|
745
|
+
fields: "entities"
|
|
746
|
+
output:
|
|
747
|
+
key: "l0_result"
|
|
748
|
+
config:
|
|
749
|
+
strict_matching: true
|
|
750
|
+
min_confidence: 0.0
|
|
751
|
+
include_unlinked: true
|
|
752
|
+
```
|
|
753
|
+
|
|
754
|
+
### Simple pipeline with reranker only (L2 → L4 → L0, no L1/L3)
|
|
755
|
+
|
|
756
|
+
Skips both NER and L3 — L4 handles entity linking directly with chunked inference:
|
|
757
|
+
|
|
758
|
+
```yaml
|
|
759
|
+
name: "simple_reranker"
|
|
760
|
+
description: "Simple pipeline with L4 reranker - no L1 or L3"
|
|
761
|
+
|
|
762
|
+
nodes:
|
|
763
|
+
- id: "l2"
|
|
764
|
+
processor: "l2_chain"
|
|
765
|
+
inputs:
|
|
766
|
+
texts:
|
|
767
|
+
source: "$input"
|
|
768
|
+
fields: "texts"
|
|
769
|
+
output:
|
|
770
|
+
key: "l2_result"
|
|
771
|
+
schema:
|
|
772
|
+
template: "{label}: {description}"
|
|
773
|
+
config:
|
|
774
|
+
max_candidates: 100
|
|
775
|
+
layers:
|
|
776
|
+
- type: "dict"
|
|
777
|
+
priority: 0
|
|
778
|
+
write: true
|
|
779
|
+
search_mode: ["exact"]
|
|
780
|
+
|
|
781
|
+
- id: "l4"
|
|
782
|
+
processor: "l4_reranker"
|
|
783
|
+
requires: ["l2"]
|
|
784
|
+
inputs:
|
|
785
|
+
texts:
|
|
786
|
+
source: "$input"
|
|
787
|
+
fields: "texts"
|
|
788
|
+
candidates:
|
|
789
|
+
source: "l2_result"
|
|
790
|
+
fields: "candidates"
|
|
791
|
+
output:
|
|
792
|
+
key: "l4_result"
|
|
793
|
+
schema:
|
|
794
|
+
template: "{label}: {description}"
|
|
795
|
+
config:
|
|
796
|
+
model_name: "knowledgator/gliner-multitask-large-v0.5"
|
|
797
|
+
device: "cpu"
|
|
798
|
+
threshold: 0.5
|
|
799
|
+
max_labels: 20
|
|
800
|
+
|
|
801
|
+
- id: "l0"
|
|
802
|
+
processor: "l0_aggregator"
|
|
803
|
+
requires: ["l2", "l4"]
|
|
804
|
+
inputs:
|
|
805
|
+
l2_candidates:
|
|
806
|
+
source: "l2_result"
|
|
807
|
+
fields: "candidates"
|
|
808
|
+
l3_entities:
|
|
809
|
+
source: "l4_result"
|
|
810
|
+
fields: "entities"
|
|
811
|
+
output:
|
|
812
|
+
key: "l0_result"
|
|
813
|
+
config:
|
|
814
|
+
strict_matching: false
|
|
815
|
+
min_confidence: 0.0
|
|
816
|
+
include_unlinked: true
|
|
817
|
+
```
|
|
818
|
+
|
|
819
|
+
### Production config with multiple database layers
|
|
820
|
+
|
|
821
|
+
```yaml
|
|
822
|
+
name: "production_pipeline"
|
|
823
|
+
|
|
824
|
+
nodes:
|
|
825
|
+
- id: "l2"
|
|
826
|
+
processor: "l2_chain"
|
|
827
|
+
config:
|
|
828
|
+
layers:
|
|
829
|
+
- type: "redis"
|
|
830
|
+
priority: 2
|
|
831
|
+
ttl: 3600
|
|
832
|
+
- type: "elasticsearch"
|
|
833
|
+
priority: 1
|
|
834
|
+
ttl: 86400
|
|
835
|
+
- type: "postgres"
|
|
836
|
+
priority: 0
|
|
837
|
+
```
|
|
838
|
+
|
|
839
|
+
---
|
|
840
|
+
|
|
841
|
+
## Use Cases
|
|
842
|
+
|
|
843
|
+
### Biomedical Text Mining
|
|
844
|
+
```python
|
|
845
|
+
builder.l1.gliner(
|
|
846
|
+
model="knowledgator/gliner-bi-base-v2.0",
|
|
847
|
+
labels=["gene", "protein", "disease", "drug", "chemical"]
|
|
848
|
+
)
|
|
849
|
+
```
|
|
850
|
+
|
|
851
|
+
### News Article Analysis
|
|
852
|
+
```python
|
|
853
|
+
builder.l1.spacy(model="en_core_web_lg")
|
|
854
|
+
# Link to Wikidata/Wikipedia entities
|
|
855
|
+
```
|
|
856
|
+
|
|
857
|
+
### Clinical NLP
|
|
858
|
+
```python
|
|
859
|
+
builder.l1.gliner(
|
|
860
|
+
model="knowledgator/gliner-bi-base-v2.0",
|
|
861
|
+
labels=["symptom", "diagnosis", "medication", "procedure"]
|
|
862
|
+
)
|
|
863
|
+
```
|
|
864
|
+
|
|
865
|
+
---
|
|
866
|
+
|
|
867
|
+
## Advanced Features
|
|
868
|
+
|
|
869
|
+
### Precomputed Embeddings (BiEncoder)
|
|
870
|
+
|
|
871
|
+
For BiEncoder models, precomputing label embeddings gives 10–100× speedups:
|
|
872
|
+
|
|
873
|
+
```python
|
|
874
|
+
# Load entities, then precompute
|
|
875
|
+
executor.load_entities("data/entities.jsonl")
|
|
876
|
+
executor.precompute_embeddings(batch_size=64)
|
|
877
|
+
|
|
878
|
+
# Or do both in create_simple
|
|
879
|
+
executor = ProcessorFactory.create_simple(
|
|
880
|
+
model_name="knowledgator/gliner-bi-base-v2.0",
|
|
881
|
+
entities="data/entities.jsonl",
|
|
882
|
+
precompute_embeddings=True,
|
|
883
|
+
)
|
|
884
|
+
```
|
|
885
|
+
|
|
886
|
+
### On-the-Fly Embedding Caching
|
|
887
|
+
|
|
888
|
+
Instead of precomputing all embeddings upfront, cache them as they are computed during inference:
|
|
889
|
+
|
|
890
|
+
```python
|
|
891
|
+
builder.l3.configure(
|
|
892
|
+
model="knowledgator/gliner-linker-large-v1.0",
|
|
893
|
+
cache_embeddings=True,
|
|
894
|
+
)
|
|
895
|
+
```
|
|
896
|
+
|
|
897
|
+
### Custom Pipelines
|
|
898
|
+
|
|
899
|
+
```python
|
|
900
|
+
# Custom L1 processing pipeline
|
|
901
|
+
l1_processor = processor_registry.get("l1_spacy")(
|
|
902
|
+
config_dict={"model": "en_core_sci_sm"},
|
|
903
|
+
pipeline=[
|
|
904
|
+
("extract_entities", {}),
|
|
905
|
+
("filter_by_length", {"min_length": 3}),
|
|
906
|
+
("deduplicate", {}),
|
|
907
|
+
("sort_by_position", {})
|
|
908
|
+
]
|
|
909
|
+
)
|
|
910
|
+
```
|
|
911
|
+
|
|
912
|
+
---
|
|
913
|
+
|
|
914
|
+
## Database Setup
|
|
915
|
+
|
|
916
|
+
### Quick Start (Docker)
|
|
917
|
+
```bash
|
|
918
|
+
# Start all databases
|
|
919
|
+
cd scripts/database
|
|
920
|
+
docker-compose up -d
|
|
921
|
+
|
|
922
|
+
# Load entities
|
|
923
|
+
python scripts/database/setup_all.sh
|
|
924
|
+
```
|
|
925
|
+
|
|
926
|
+
### Manual Setup
|
|
927
|
+
```python
|
|
928
|
+
from glinker import DAGExecutor
|
|
929
|
+
|
|
930
|
+
executor = DAGExecutor(pipeline)
|
|
931
|
+
executor.load_entities(
|
|
932
|
+
filepath="data/entities.jsonl",
|
|
933
|
+
target_layers=["redis", "elasticsearch", "postgres"],
|
|
934
|
+
batch_size=1000
|
|
935
|
+
)
|
|
936
|
+
```
|
|
937
|
+
|
|
938
|
+
---
|
|
939
|
+
|
|
940
|
+
## Testing
|
|
941
|
+
|
|
942
|
+
```bash
|
|
943
|
+
# Run all tests
|
|
944
|
+
pytest
|
|
945
|
+
|
|
946
|
+
# Run specific layer tests
|
|
947
|
+
pytest tests/l1/
|
|
948
|
+
pytest tests/l2/
|
|
949
|
+
|
|
950
|
+
# Run with coverage
|
|
951
|
+
pytest --cov=glinker --cov-report=html
|
|
952
|
+
```
|
|
953
|
+
|
|
954
|
+
## Citations
|
|
955
|
+
|
|
956
|
+
If you find GLiNKER useful in your research, please consider citing our papers:
|
|
957
|
+
|
|
958
|
+
```bibtex
|
|
959
|
+
@misc{stepanov2024glinermultitaskgeneralistlightweight,
|
|
960
|
+
title={GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks},
|
|
961
|
+
author={Ihor Stepanov and Mykhailo Shtopko},
|
|
962
|
+
year={2024},
|
|
963
|
+
eprint={2406.12925},
|
|
964
|
+
archivePrefix={arXiv},
|
|
965
|
+
primaryClass={cs.LG},
|
|
966
|
+
url={https://arxiv.org/abs/2406.12925},
|
|
967
|
+
}
|
|
968
|
+
```
|
|
969
|
+
|
|
970
|
+
## Contributing
|
|
971
|
+
|
|
972
|
+
We welcome contributions! Areas of interest:
|
|
973
|
+
|
|
974
|
+
- **Database layers** (MongoDB, Neo4j, vector databases)
|
|
975
|
+
- **Performance optimizations**
|
|
976
|
+
- **Documentation improvements**
|
|
977
|
+
|
|
978
|
+
## License
|
|
979
|
+
|
|
980
|
+
Apache 2.0 License — see [LICENSE](LICENSE) file for details.
|
|
981
|
+
|
|
982
|
+
## Acknowledgments
|
|
983
|
+
|
|
984
|
+
- **GLiNER** — Zero-shot NER and entity linking ([urchade/GLiNER](https://github.com/urchade/GLiNER))
|
|
985
|
+
- **spaCy** — Industrial-strength NLP ([explosion/spaCy](https://github.com/explosion/spaCy))
|
|
986
|
+
|
|
987
|
+
## Contact
|
|
988
|
+
|
|
989
|
+
- **GitHub**: [Knowledgator/GLinker](https://github.com/Knowledgator/GLinker)
|
|
990
|
+
- **Email**: info@knowledgator.com
|
|
991
|
+
|
|
992
|
+
---
|
|
993
|
+
|
|
994
|
+
Developed by [Knowledgator](https://knowledgator.com)
|