glinker 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,994 @@
1
+ Metadata-Version: 2.4
2
+ Name: glinker
3
+ Version: 0.1.0
4
+ Summary: GLiNKER - A modular multi-layer entity linking framework
5
+ Author-email: Knowledgator <info@knowledgator.com>
6
+ License: Apache-2.0
7
+ Project-URL: Homepage, https://github.com/Knowledgator/GLinker
8
+ Project-URL: Repository, https://github.com/Knowledgator/GLinker
9
+ Project-URL: Documentation, https://github.com/Knowledgator/GLinker/blob/main/README.md
10
+ Keywords: entity-linking,nlp,gliner,spacy,ner
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: License :: OSI Approved :: Apache Software License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
19
+ Requires-Python: >=3.10
20
+ Description-Content-Type: text/markdown
21
+ License-File: LICENSE
22
+ Requires-Dist: spacy>=3.7.0
23
+ Requires-Dist: gliner>=0.2.0
24
+ Requires-Dist: torch>=2.0.0
25
+ Requires-Dist: redis>=5.0.0
26
+ Requires-Dist: elasticsearch>=8.11.0
27
+ Requires-Dist: psycopg2-binary>=2.9.0
28
+ Requires-Dist: pydantic>=2.0.0
29
+ Requires-Dist: pyyaml>=6.0.0
30
+ Requires-Dist: tqdm>=4.65.0
31
+ Provides-Extra: dev
32
+ Requires-Dist: pytest>=7.4.0; extra == "dev"
33
+ Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
34
+ Requires-Dist: black>=23.0.0; extra == "dev"
35
+ Requires-Dist: ruff>=0.1.0; extra == "dev"
36
+ Provides-Extra: demo
37
+ Requires-Dist: gradio>=4.0.0; extra == "demo"
38
+ Provides-Extra: all
39
+ Requires-Dist: glinker[demo,dev]; extra == "all"
40
+ Dynamic: license-file
41
+
42
+ # GLiNKER - Entity Linking Framework
43
+
44
+ <div align="center">
45
+ <div>
46
+ <a href="https://arxiv.org/abs/2406.12925"><img src="https://img.shields.io/badge/arXiv-2406.12925-b31b1b.svg" alt="GLiNER-bi-Encoder"></a>
47
+ <a href="https://discord.gg/HbW9aNJ9"><img alt="Discord" src="https://img.shields.io/discord/1089800235347353640?logo=discord&logoColor=white&label=Discord&color=blue"></a>
48
+ <a href="https://github.com/Knowledgator/GLinker/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/github/license/Knowledgator/GLinker?color=blue"></a>
49
+ <a href="https://hf.co/collections/knowledgator/gliner-linker"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow" alt="HuggingFace Models"></a>
50
+ <a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License: Apache 2.0"></a>
51
+ <a href="https://pypi.org/project/glinker/"><img src="https://badge.fury.io/py/glinker.svg" alt="PyPI version"></a>
52
+ </div>
53
+ <br>
54
+ </div>
55
+
56
+ ![alt text](logo/header.png)
57
+
58
+ > A modular, production-ready entity linking framework combining NER, multi-layer database search, and neural entity disambiguation.
59
+
60
+ ## Overview
61
+
62
+ GLiNKER is a modular entity linking pipeline that transforms raw text into structured, disambiguated entity mentions. It's designed for:
63
+
64
+ - **Production use**: Multi-layer caching (Redis → Elasticsearch → PostgreSQL)
65
+ - **Research flexibility**: Fully configurable YAML pipelines
66
+ - **Performance**: Embedding precomputation for BiEncoder models
67
+ - **Scalability**: DAG-based execution with batch processing
68
+
69
+
70
+ GLiNKER is built around GLiNER — a family of lightweight, generalist models for information extraction. It brings several key advantages to the entity linking pipeline:
71
+
72
+ - **Zero-shot recognition** — Identify any entity type by simply providing label names. No fine-tuning or annotated data required. Switch from biomedical genes to legal entities by changing a list of strings.
73
+ - **Unified architecture** — A single model handles both NER (L1) and entity disambiguation (L3/L4), reducing deployment complexity and keeping the inference stack consistent.
74
+ - **Efficient BiEncoder support** — BiEncoder variants allow precomputing label embeddings once and reusing them across millions of documents, delivering 10–100× speedups for large-scale linking.
75
+ - **Compact and fast** — Base models are small enough to run on CPU, while larger variants scale with GPU for production throughput.
76
+ - **Open and extensible** — Apache 2.0 licensed models on Hugging Face, easy to swap for domain-specific fine-tunes when needed.
77
+
78
+
79
+ ### Traditional vs GLiNKER Approach
80
+
81
+ ```python
82
+ # Traditional approach: Complex, coupled code
83
+ ner_results = spacy_model(text)
84
+ candidates = search_database(ner_results)
85
+ linked = gliner_model.disambiguate(candidates)
86
+ # Mix of models, databases, and business logic
87
+
88
+ # GLiNKER approach: Declarative configuration
89
+ from glinker import ConfigBuilder, DAGExecutor
90
+
91
+ builder = ConfigBuilder(name="biomedical_el")
92
+ builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "protein", "disease"])
93
+ builder.l2.add("redis", priority=2).add("postgres", priority=0)
94
+ builder.l3.configure(model="knowledgator/gliner-linker-large-v1.0")
95
+
96
+ executor = DAGExecutor(builder.get_config())
97
+ result = executor.execute({"texts": ["CRISPR-Cas9 enables precise gene therapy"]})
98
+ ```
99
+
100
+ ## Table of Contents
101
+
102
+ - [Quick Start](#quick-start)
103
+ - [Creating Pipelines](#creating-pipelines)
104
+ - [Option 1: create_simple (recommended start)](#option-1-create_simple-recommended-start)
105
+ - [Option 2: From a YAML config file](#option-2-from-a-yaml-config-file)
106
+ - [Option 3: ConfigBuilder (programmatic)](#option-3-configbuilder-programmatic)
107
+ - [Loading Entities](#loading-entities)
108
+ - [From a JSONL file](#from-a-jsonl-file)
109
+ - [From a Python list](#from-a-python-list)
110
+ - [From a Python dict](#from-a-python-dict)
111
+ - [Entity format reference](#entity-format-reference)
112
+ - [Architecture](#architecture)
113
+ - [Features](#features)
114
+ - [YAML Configuration Reference](#yaml-configuration-reference)
115
+ - [Advanced Features](#advanced-features)
116
+ - [Database Setup](#database-setup)
117
+ - [Testing](#testing)
118
+ - [Citations](#citations)
119
+
120
+ ## Quick Start
121
+
122
+ ### Installation
123
+
124
+ Install easily using pip:
125
+
126
+ ```bash
127
+ pip install glinker
128
+ ```
129
+
130
+ Or install from source:
131
+
132
+ ```bash
133
+ git clone https://github.com/Knowledgator/GLinker.git
134
+ cd GLinker
135
+ pip install -e .
136
+
137
+ # With optional dependencies
138
+ pip install -e ".[dev,demo]"
139
+ ```
140
+
141
+ ### 30-Second Example
142
+
143
+ ```python
144
+ from glinker import ConfigBuilder, DAGExecutor
145
+
146
+ # 1. Build configuration
147
+ builder = ConfigBuilder(name="demo")
148
+ builder.l1.spacy(model="en_core_web_sm")
149
+ builder.l3.configure(model="knowledgator/gliner-linker-large-v1.0")
150
+
151
+ # 2. Create executor
152
+ executor = DAGExecutor(builder.get_config())
153
+
154
+ # 3. Load entities
155
+ executor.load_entities("data/entities.jsonl", target_layers=["dict"])
156
+
157
+ # 4. Process text
158
+ result = executor.execute({
159
+ "texts": ["Farnese Palace is one of the most important palaces in the city of Rome."]
160
+ })
161
+
162
+ # 5. Get results
163
+ l0_result = result.get("l0_result")
164
+ for entity in l0_result.entities:
165
+ if entity.linked_entity:
166
+ print(f"{entity.mention_text} → {entity.linked_entity.label}")
167
+ print(f" Confidence: {entity.linked_entity.score:.3f}")
168
+ ```
169
+
170
+ **Output:**
171
+ ```
172
+ BRCA1 → BRCA1: Breast cancer type 1 susceptibility protein
173
+ Confidence: 0.923
174
+ breast cancer → Breast Cancer: Malignant neoplasm of the breast
175
+ Confidence: 0.887
176
+ ```
177
+
178
+ ---
179
+
180
+ ## Creating Pipelines
181
+
182
+ GLiNKER offers three ways to create a pipeline, from simplest to most configurable.
183
+
184
+ ### Option 1: `create_simple` (recommended start)
185
+
186
+ `ProcessorFactory.create_simple` builds a **L2 → L3 → L0** pipeline in one call. No NER step — the model links entities directly from the input text against all loaded entities.
187
+
188
+ ```python
189
+ from glinker import ProcessorFactory
190
+
191
+ # Minimal — just a model name
192
+ executor = ProcessorFactory.create_simple(
193
+ model_name="knowledgator/gliner-bi-base-v2.0",
194
+ threshold=0.5,
195
+ )
196
+
197
+ # Load entities and run
198
+ executor.load_entities("data/entities.jsonl")
199
+ result = executor.execute({"texts": ["CRISPR-Cas9 enables precise gene therapy."]})
200
+ ```
201
+
202
+ **With inline entities (no file needed):**
203
+
204
+ ```python
205
+ executor = ProcessorFactory.create_simple(
206
+ model_name="knowledgator/gliner-bi-base-v2.0",
207
+ threshold=0.5,
208
+ entities=[
209
+ {"entity_id": "Q101", "label": "insulin", "description": "Peptide hormone regulating blood glucose"},
210
+ {"entity_id": "Q102", "label": "glucose", "description": "Primary blood sugar and key metabolic fuel"},
211
+ {"entity_id": "Q103", "label": "GLUT4", "description": "Insulin-responsive glucose transporter in muscle and adipose tissue"},
212
+ {"entity_id": "Q104", "label": "pancreatic beta cell", "description": "Endocrine cell type that secretes insulin"},
213
+ ],
214
+ )
215
+
216
+ result = executor.execute({
217
+ "texts": [
218
+ "After a meal, pancreatic beta cells release insulin, which promotes GLUT4 translocation and increases glucose uptake in muscle."
219
+ ]
220
+ })
221
+ ```
222
+
223
+ **With a reranker (L2 → L3 → L4 → L0):**
224
+
225
+ ```python
226
+ executor = ProcessorFactory.create_simple(
227
+ model_name="knowledgator/gliner-bi-base-v2.0",
228
+ threshold=0.5,
229
+ reranker_model="knowledgator/gliner-multitask-large-v0.5",
230
+ reranker_max_labels=20,
231
+ reranker_threshold=0.3,
232
+ entities="data/entities.jsonl",
233
+ precompute_embeddings=True,
234
+ )
235
+ ```
236
+
237
+ **With entity descriptions in the template:**
238
+
239
+ ```python
240
+ executor = ProcessorFactory.create_simple(
241
+ model_name="knowledgator/gliner-bi-base-v2.0",
242
+ template="{label}: {description}", # L3 sees "BRCA1: Breast cancer type 1 susceptibility protein"
243
+ entities="data/entities.jsonl",
244
+ )
245
+ ```
246
+
247
+ **All `create_simple` parameters:**
248
+
249
+ | Parameter | Default | Description |
250
+ |-----------|---------|-------------|
251
+ | `model_name` | *(required)* | HuggingFace model ID or local path |
252
+ | `device` | `"cpu"` | Torch device (`"cpu"`, `"cuda"`, `"cuda:0"`) |
253
+ | `threshold` | `0.5` | Minimum score for entity predictions |
254
+ | `template` | `"{label}"` | Format string for entity labels (e.g. `"{label}: {description}"`) |
255
+ | `max_length` | `512` | Max sequence length for tokenization |
256
+ | `token` | `None` | HuggingFace auth token for gated models |
257
+ | `entities` | `None` | Entity data to load immediately (file path, list of dicts, or dict of dicts) |
258
+ | `precompute_embeddings` | `False` | Pre-embed all entity labels after loading (BiEncoder only) |
259
+ | `verbose` | `False` | Enable verbose logging |
260
+ | `reranker_model` | `None` | GLiNER model for L4 reranking (adds L4 node when set) |
261
+ | `reranker_max_labels` | `20` | Max candidate labels per L4 inference call |
262
+ | `reranker_threshold` | `None` | Score threshold for L4 (defaults to `threshold`) |
263
+
264
+ ### Option 2: From a YAML config file
265
+
266
+ For full control over every layer, define the pipeline in YAML and load it:
267
+
268
+ ```python
269
+ from glinker import ProcessorFactory
270
+
271
+ executor = ProcessorFactory.create_pipeline("configs/pipelines/dict/simple.yaml")
272
+ executor.load_entities("data/entities.jsonl")
273
+ result = executor.execute({"texts": ["TP53 mutations cause cancer"]})
274
+ ```
275
+
276
+ See [YAML Configuration Reference](#yaml-configuration-reference) for full config examples.
277
+
278
+ ### Option 3: `ConfigBuilder` (programmatic)
279
+
280
+ Build configs in Python with full control over each layer:
281
+
282
+ ```python
283
+ from glinker import ConfigBuilder, DAGExecutor
284
+
285
+ builder = ConfigBuilder(name="my_pipeline")
286
+ builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "disease"])
287
+ builder.l3.configure(model="knowledgator/gliner-linker-large-v1.0")
288
+
289
+ executor = DAGExecutor(builder.get_config())
290
+ executor.load_entities("data/entities.jsonl", target_layers=["dict"])
291
+ ```
292
+
293
+ **With multiple database layers:**
294
+
295
+ ```python
296
+ builder = ConfigBuilder(name="production")
297
+ builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "protein"])
298
+ builder.l2.add("redis", priority=2, ttl=3600)
299
+ builder.l2.add("elasticsearch", priority=1, ttl=86400)
300
+ builder.l2.add("postgres", priority=0)
301
+ builder.l3.configure(model="knowledgator/gliner-linker-large-v1.0", use_precomputed_embeddings=True)
302
+ builder.l0.configure(strict_matching=True, min_confidence=0.3)
303
+ builder.save("config.yaml")
304
+ ```
305
+
306
+ **With L4 reranker:**
307
+
308
+ ```python
309
+ builder = ConfigBuilder(name="reranked")
310
+ builder.l1.gliner(model="knowledgator/gliner-bi-base-v2.0", labels=["gene", "disease"])
311
+ builder.l3.configure(model="knowledgator/gliner-linker-base-v1.0")
312
+ builder.l4.configure(
313
+ model="knowledgator/gliner-multitask-large-v0.5",
314
+ threshold=0.3,
315
+ max_labels=20,
316
+ )
317
+ builder.save("config.yaml") # Generates L1 → L2 → L3 → L4 → L0
318
+ ```
319
+
320
+ ---
321
+
322
+ ## Loading Entities
323
+
324
+ Entities can be loaded after pipeline creation via `executor.load_entities()`, or passed directly to `create_simple(entities=...)`. Three input formats are supported.
325
+
326
+ ### From a JSONL file
327
+
328
+ One JSON object per line:
329
+
330
+ ```python
331
+ executor.load_entities("data/entities.jsonl")
332
+
333
+ # Or target specific database layers
334
+ executor.load_entities("data/entities.jsonl", target_layers=["dict", "postgres"])
335
+ ```
336
+
337
+ **`data/entities.jsonl`:**
338
+
339
+ ```jsonl
340
+ {"entity_id": "Q123", "label": "Kyiv", "description": "Capital and largest city of Ukraine", "entity_type": "city", "popularity": 1000000, "aliases": ["Kiev"]}
341
+ {"entity_id": "Q456", "label": "Dnipro River", "description": "Major river flowing through Ukraine and Belarus", "entity_type": "river", "popularity": 950000, "aliases": ["Dnieper"]}
342
+ {"entity_id": "Q789", "label": "Carpathian Mountains", "description": "Mountain range in Central and Eastern Europe", "entity_type": "mountain_range", "popularity": 800000, "aliases": ["Carpathians"]}
343
+ ```
344
+
345
+ ### From a Python list
346
+
347
+ ```python
348
+ entities = [
349
+ {
350
+ "entity_id": "Q123",
351
+ "label": "Kyiv",
352
+ "description": "Capital and largest city of Ukraine",
353
+ "entity_type": "city",
354
+ "aliases": ["Kiev"],
355
+ },
356
+ {
357
+ "entity_id": "Q456",
358
+ "label": "Dnipro River",
359
+ "description": "Major river flowing through Ukraine and Belarus",
360
+ "entity_type": "river",
361
+ "aliases": ["Dnieper"],
362
+ },
363
+ ]
364
+
365
+ executor.load_entities(entities)
366
+ ```
367
+
368
+ ### From a Python dict
369
+
370
+ Keys are entity IDs, values are entity data:
371
+
372
+ ```python
373
+ entities = {
374
+ "Q123": {
375
+ "label": "Kyiv",
376
+ "description": "Capital and largest city of Ukraine",
377
+ "entity_type": "city",
378
+ },
379
+ "Q456": {
380
+ "label": "Dnipro River",
381
+ "description": "Major river flowing through Ukraine and Belarus",
382
+ "entity_type": "river",
383
+ },
384
+ }
385
+
386
+ executor.load_entities(entities)
387
+ ```
388
+
389
+
390
+ ### Entity format reference
391
+
392
+ | Field | Type | Required | Default | Description |
393
+ |-------|------|----------|---------|-------------|
394
+ | `entity_id` | str | **yes** | — | Unique identifier |
395
+ | `label` | str | **yes** | — | Primary name |
396
+ | `description` | str | no | `""` | Text description (used in templates like `"{label}: {description}"`) |
397
+ | `entity_type` | str | no | `""` | Category (e.g. `"gene"`, `"disease"`) |
398
+ | `aliases` | list[str] | no | `[]` | Alternative names for search matching |
399
+ | `popularity` | int | no | `0` | Ranking score for candidate ordering |
400
+
401
+ ---
402
+
403
+ ## Architecture
404
+
405
+ GLiNKER uses a **layered pipeline** with an optional reranking stage:
406
+
407
+ ![alt text](logo/architecture.png)
408
+
409
+ | Layer | Purpose | Processor |
410
+ |-------|---------|-----------|
411
+ | **L1** | Mention extraction (spaCy or GLiNER NER) | `l1_spacy`, `l1_gliner` |
412
+ | **L2** | Candidate retrieval from database layers | `l2_chain` |
413
+ | **L3** | Entity disambiguation via GLiNER | `l3_batch` |
414
+ | **L4** | *(Optional)* GLiNER reranking with candidate chunking | `l4_reranker` |
415
+ | **L0** | Aggregation, filtering, and final output | `l0_aggregator` |
416
+
417
+ **Supported topologies:**
418
+ ```
419
+ Full pipeline: L1 → L2 → L3 → L0
420
+ With reranking: L1 → L2 → L3 → L4 → L0
421
+ Simple (no NER): L2 → L3 → L0
422
+ Simple + reranker: L2 → L4 → L0
423
+ ```
424
+
425
+ **Key Concepts:**
426
+
427
+ - **DAG Execution**: Layers execute in dependency order with automatic data flow
428
+ - **Component-Processor Pattern**: Each layer has a Component (methods) and Processor (orchestration)
429
+ - **Schema Consistency**: Single template (e.g., `"{label}: {description}"`) across layers
430
+ - **Cache Hierarchy**: Upper layers cache results from lower layers automatically
431
+
432
+ ---
433
+
434
+ ## Features
435
+
436
+ ### Multiple NER Backends
437
+ - **spaCy** — Fast, rule-based NER for standard use cases
438
+ - **GLiNER** — Neural NER with custom labels (no training required)
439
+
440
+ ### Multi-Layer Database Support
441
+ - **Dict** — In-memory (perfect for demos)
442
+ - **Redis** — Fast cache (production)
443
+ - **Elasticsearch** — Full-text search with fuzzy matching
444
+ - **PostgreSQL** — Persistent storage with pg_trgm fuzzy search
445
+
446
+ ### Performance Optimization
447
+ - **Embedding Precomputation** — Cache label embeddings for BiEncoder models
448
+ - **Cache Hierarchy** — Automatic write-back: Redis → ES → PostgreSQL
449
+ - **Batch Processing** — Efficient parallel processing
450
+
451
+ ### L4 Reranker (Optional)
452
+
453
+ When the candidate set from L2 is large (tens or hundreds of entities), a single GLiNER call may be impractical. The **L4 reranker** solves this by splitting candidates into chunks:
454
+
455
+ ```
456
+ 100 candidates, max_labels=20 → 5 GLiNER inference calls
457
+ Results merged, deduplicated, filtered by threshold
458
+ ```
459
+
460
+ L4 uses a **uni-encoder GLiNER model** and can be placed after L3 (true reranking) or used directly after L2 (replacing L3):
461
+
462
+ ```python
463
+ # Via ConfigBuilder
464
+ builder.l4.configure(
465
+ model="knowledgator/gliner-multitask-large-v0.5",
466
+ threshold=0.3,
467
+ max_labels=20 # candidates per inference call
468
+ )
469
+
470
+ # Via create_simple
471
+ executor = ProcessorFactory.create_simple(
472
+ model_name="knowledgator/gliner-bi-base-v2.0",
473
+ reranker_model="knowledgator/gliner-multitask-large-v0.5",
474
+ reranker_max_labels=20,
475
+ )
476
+ ```
477
+
478
+ ---
479
+
480
+ ## YAML Configuration Reference
481
+
482
+ YAML configs give full control over every node in the pipeline. Load them with:
483
+
484
+ ```python
485
+ from glinker import ProcessorFactory
486
+
487
+ executor = ProcessorFactory.create_pipeline("path/to/config.yaml")
488
+ ```
489
+
490
+ ### Simple pipeline (L2 → L3 → L0, no NER)
491
+
492
+ Equivalent to `create_simple`. No L1 node — texts are passed directly to L2/L3:
493
+
494
+ ```yaml
495
+ name: "simple"
496
+ description: "Simple pipeline - L3 only with entity database"
497
+
498
+ nodes:
499
+ - id: "l2"
500
+ processor: "l2_chain"
501
+ inputs:
502
+ texts:
503
+ source: "$input"
504
+ fields: "texts"
505
+ output:
506
+ key: "l2_result"
507
+ schema:
508
+ template: "{label}"
509
+ config:
510
+ max_candidates: 30
511
+ min_popularity: 0
512
+ layers:
513
+ - type: "dict"
514
+ priority: 0
515
+ write: true
516
+ search_mode: ["exact"]
517
+
518
+ - id: "l3"
519
+ processor: "l3_batch"
520
+ requires: ["l2"]
521
+ inputs:
522
+ texts:
523
+ source: "$input"
524
+ fields: "texts"
525
+ candidates:
526
+ source: "l2_result"
527
+ fields: "candidates"
528
+ output:
529
+ key: "l3_result"
530
+ schema:
531
+ template: "{label}"
532
+ config:
533
+ model_name: "knowledgator/gliner-bi-base-v2.0"
534
+ device: "cpu"
535
+ threshold: 0.5
536
+ flat_ner: true
537
+ multi_label: false
538
+ use_precomputed_embeddings: true
539
+ cache_embeddings: false
540
+ max_length: 512
541
+
542
+ - id: "l0"
543
+ processor: "l0_aggregator"
544
+ requires: ["l2", "l3"]
545
+ inputs:
546
+ l2_candidates:
547
+ source: "l2_result"
548
+ fields: "candidates"
549
+ l3_entities:
550
+ source: "l3_result"
551
+ fields: "entities"
552
+ output:
553
+ key: "l0_result"
554
+ config:
555
+ strict_matching: false
556
+ min_confidence: 0.0
557
+ include_unlinked: true
558
+ position_tolerance: 2
559
+ ```
560
+
561
+ ### Full pipeline with spaCy NER (L1 → L2 → L3 → L0)
562
+
563
+ ```yaml
564
+ name: "dict_default"
565
+ description: "In-memory dict layer with spaCy NER"
566
+
567
+ nodes:
568
+ - id: "l1"
569
+ processor: "l1_spacy"
570
+ inputs:
571
+ texts:
572
+ source: "$input"
573
+ fields: "texts"
574
+ output:
575
+ key: "l1_result"
576
+ config:
577
+ model: "en_core_sci_sm"
578
+ device: "cpu"
579
+ batch_size: 1
580
+ min_entity_length: 2
581
+ include_noun_chunks: true
582
+
583
+ - id: "l2"
584
+ processor: "l2_chain"
585
+ requires: ["l1"]
586
+ inputs:
587
+ mentions:
588
+ source: "l1_result"
589
+ fields: "entities"
590
+ output:
591
+ key: "l2_result"
592
+ schema:
593
+ template: "{label}: {description}"
594
+ config:
595
+ max_candidates: 5
596
+ layers:
597
+ - type: "dict"
598
+ priority: 0
599
+ write: true
600
+ search_mode: ["exact", "fuzzy"]
601
+ fuzzy:
602
+ max_distance: 64
603
+ min_similarity: 0.6
604
+
605
+ - id: "l3"
606
+ processor: "l3_batch"
607
+ requires: ["l2"]
608
+ inputs:
609
+ texts:
610
+ source: "$input"
611
+ fields: "texts"
612
+ candidates:
613
+ source: "l2_result"
614
+ fields: "candidates"
615
+ output:
616
+ key: "l3_result"
617
+ schema:
618
+ template: "{label}: {description}"
619
+ config:
620
+ model_name: "knowledgator/gliner-linker-large-v1.0"
621
+ device: "cpu"
622
+ threshold: 0.5
623
+ flat_ner: true
624
+ multi_label: false
625
+ max_length: 512
626
+
627
+ - id: "l0"
628
+ processor: "l0_aggregator"
629
+ requires: ["l1", "l2", "l3"]
630
+ inputs:
631
+ l1_entities:
632
+ source: "l1_result"
633
+ fields: "entities"
634
+ l2_candidates:
635
+ source: "l2_result"
636
+ fields: "candidates"
637
+ l3_entities:
638
+ source: "l3_result"
639
+ fields: "entities"
640
+ output:
641
+ key: "l0_result"
642
+ config:
643
+ strict_matching: true
644
+ min_confidence: 0.0
645
+ include_unlinked: true
646
+ position_tolerance: 2
647
+ ```
648
+
649
+ ### Pipeline with L4 reranker (L1 → L2 → L3 → L4 → L0)
650
+
651
+ Use when the candidate set is large. L4 splits candidates into chunks of `max_labels` and runs GLiNER inference on each chunk:
652
+
653
+ ```yaml
654
+ name: "dict_reranker"
655
+ description: "In-memory dict with L4 GLiNER reranking"
656
+
657
+ nodes:
658
+ - id: "l1"
659
+ processor: "l1_gliner"
660
+ inputs:
661
+ texts:
662
+ source: "$input"
663
+ fields: "texts"
664
+ output:
665
+ key: "l1_result"
666
+ config:
667
+ model: "knowledgator/gliner-bi-base-v2.0"
668
+ labels: ["gene", "drug", "disease", "person", "organization"]
669
+ device: "cpu"
670
+
671
+ - id: "l2"
672
+ processor: "l2_chain"
673
+ requires: ["l1"]
674
+ inputs:
675
+ mentions:
676
+ source: "l1_result"
677
+ fields: "entities"
678
+ output:
679
+ key: "l2_result"
680
+ schema:
681
+ template: "{label}: {description}"
682
+ config:
683
+ max_candidates: 100
684
+ layers:
685
+ - type: "dict"
686
+ priority: 0
687
+ write: true
688
+ search_mode: ["exact", "fuzzy"]
689
+
690
+ - id: "l3"
691
+ processor: "l3_batch"
692
+ requires: ["l1", "l2"]
693
+ inputs:
694
+ texts:
695
+ source: "$input"
696
+ fields: "texts"
697
+ candidates:
698
+ source: "l2_result"
699
+ fields: "candidates"
700
+ output:
701
+ key: "l3_result"
702
+ schema:
703
+ template: "{label}: {description}"
704
+ config:
705
+ model_name: "knowledgator/gliner-linker-base-v1.0"
706
+ device: "cpu"
707
+ threshold: 0.5
708
+ use_precomputed_embeddings: true
709
+
710
+ - id: "l4"
711
+ processor: "l4_reranker"
712
+ requires: ["l1", "l2", "l3"]
713
+ inputs:
714
+ texts:
715
+ source: "$input"
716
+ fields: "texts"
717
+ candidates:
718
+ source: "l2_result"
719
+ fields: "candidates"
720
+ l1_entities:
721
+ source: "l1_result"
722
+ fields: "entities"
723
+ output:
724
+ key: "l4_result"
725
+ schema:
726
+ template: "{label}: {description}"
727
+ config:
728
+ model_name: "knowledgator/gliner-multitask-large-v0.5"
729
+ device: "cpu"
730
+ threshold: 0.3
731
+ max_labels: 20 # candidates per inference call
732
+
733
+ - id: "l0"
734
+ processor: "l0_aggregator"
735
+ requires: ["l1", "l2", "l4"]
736
+ inputs:
737
+ l1_entities:
738
+ source: "l1_result"
739
+ fields: "entities"
740
+ l2_candidates:
741
+ source: "l2_result"
742
+ fields: "candidates"
743
+ l3_entities:
744
+ source: "l4_result" # L0 reads from L4 instead of L3
745
+ fields: "entities"
746
+ output:
747
+ key: "l0_result"
748
+ config:
749
+ strict_matching: true
750
+ min_confidence: 0.0
751
+ include_unlinked: true
752
+ ```
753
+
754
+ ### Simple pipeline with reranker only (L2 → L4 → L0, no L1/L3)
755
+
756
+ Skips both NER and L3 — L4 handles entity linking directly with chunked inference:
757
+
758
+ ```yaml
759
+ name: "simple_reranker"
760
+ description: "Simple pipeline with L4 reranker - no L1 or L3"
761
+
762
+ nodes:
763
+ - id: "l2"
764
+ processor: "l2_chain"
765
+ inputs:
766
+ texts:
767
+ source: "$input"
768
+ fields: "texts"
769
+ output:
770
+ key: "l2_result"
771
+ schema:
772
+ template: "{label}: {description}"
773
+ config:
774
+ max_candidates: 100
775
+ layers:
776
+ - type: "dict"
777
+ priority: 0
778
+ write: true
779
+ search_mode: ["exact"]
780
+
781
+ - id: "l4"
782
+ processor: "l4_reranker"
783
+ requires: ["l2"]
784
+ inputs:
785
+ texts:
786
+ source: "$input"
787
+ fields: "texts"
788
+ candidates:
789
+ source: "l2_result"
790
+ fields: "candidates"
791
+ output:
792
+ key: "l4_result"
793
+ schema:
794
+ template: "{label}: {description}"
795
+ config:
796
+ model_name: "knowledgator/gliner-multitask-large-v0.5"
797
+ device: "cpu"
798
+ threshold: 0.5
799
+ max_labels: 20
800
+
801
+ - id: "l0"
802
+ processor: "l0_aggregator"
803
+ requires: ["l2", "l4"]
804
+ inputs:
805
+ l2_candidates:
806
+ source: "l2_result"
807
+ fields: "candidates"
808
+ l3_entities:
809
+ source: "l4_result"
810
+ fields: "entities"
811
+ output:
812
+ key: "l0_result"
813
+ config:
814
+ strict_matching: false
815
+ min_confidence: 0.0
816
+ include_unlinked: true
817
+ ```
818
+
819
+ ### Production config with multiple database layers
820
+
821
+ ```yaml
822
+ name: "production_pipeline"
823
+
824
+ nodes:
825
+ - id: "l2"
826
+ processor: "l2_chain"
827
+ config:
828
+ layers:
829
+ - type: "redis"
830
+ priority: 2
831
+ ttl: 3600
832
+ - type: "elasticsearch"
833
+ priority: 1
834
+ ttl: 86400
835
+ - type: "postgres"
836
+ priority: 0
837
+ ```
838
+
839
+ ---
840
+
841
+ ## Use Cases
842
+
843
+ ### Biomedical Text Mining
844
+ ```python
845
+ builder.l1.gliner(
846
+ model="knowledgator/gliner-bi-base-v2.0",
847
+ labels=["gene", "protein", "disease", "drug", "chemical"]
848
+ )
849
+ ```
850
+
851
+ ### News Article Analysis
852
+ ```python
853
+ builder.l1.spacy(model="en_core_web_lg")
854
+ # Link to Wikidata/Wikipedia entities
855
+ ```
856
+
857
+ ### Clinical NLP
858
+ ```python
859
+ builder.l1.gliner(
860
+ model="knowledgator/gliner-bi-base-v2.0",
861
+ labels=["symptom", "diagnosis", "medication", "procedure"]
862
+ )
863
+ ```
864
+
865
+ ---
866
+
867
+ ## Advanced Features
868
+
869
+ ### Precomputed Embeddings (BiEncoder)
870
+
871
+ For BiEncoder models, precomputing label embeddings gives 10–100× speedups:
872
+
873
+ ```python
874
+ # Load entities, then precompute
875
+ executor.load_entities("data/entities.jsonl")
876
+ executor.precompute_embeddings(batch_size=64)
877
+
878
+ # Or do both in create_simple
879
+ executor = ProcessorFactory.create_simple(
880
+ model_name="knowledgator/gliner-bi-base-v2.0",
881
+ entities="data/entities.jsonl",
882
+ precompute_embeddings=True,
883
+ )
884
+ ```
885
+
886
+ ### On-the-Fly Embedding Caching
887
+
888
+ Instead of precomputing all embeddings upfront, cache them as they are computed during inference:
889
+
890
+ ```python
891
+ builder.l3.configure(
892
+ model="knowledgator/gliner-linker-large-v1.0",
893
+ cache_embeddings=True,
894
+ )
895
+ ```
896
+
897
+ ### Custom Pipelines
898
+
899
+ ```python
900
+ # Custom L1 processing pipeline
901
+ l1_processor = processor_registry.get("l1_spacy")(
902
+ config_dict={"model": "en_core_sci_sm"},
903
+ pipeline=[
904
+ ("extract_entities", {}),
905
+ ("filter_by_length", {"min_length": 3}),
906
+ ("deduplicate", {}),
907
+ ("sort_by_position", {})
908
+ ]
909
+ )
910
+ ```
911
+
912
+ ---
913
+
914
+ ## Database Setup
915
+
916
+ ### Quick Start (Docker)
917
+ ```bash
918
+ # Start all databases
919
+ cd scripts/database
920
+ docker-compose up -d
921
+
922
+ # Load entities
923
+ python scripts/database/setup_all.sh
924
+ ```
925
+
926
+ ### Manual Setup
927
+ ```python
928
+ from glinker import DAGExecutor
929
+
930
+ executor = DAGExecutor(pipeline)
931
+ executor.load_entities(
932
+ filepath="data/entities.jsonl",
933
+ target_layers=["redis", "elasticsearch", "postgres"],
934
+ batch_size=1000
935
+ )
936
+ ```
937
+
938
+ ---
939
+
940
+ ## Testing
941
+
942
+ ```bash
943
+ # Run all tests
944
+ pytest
945
+
946
+ # Run specific layer tests
947
+ pytest tests/l1/
948
+ pytest tests/l2/
949
+
950
+ # Run with coverage
951
+ pytest --cov=glinker --cov-report=html
952
+ ```
953
+
954
+ ## Citations
955
+
956
+ If you find GLiNKER useful in your research, please consider citing our papers:
957
+
958
+ ```bibtex
959
+ @misc{stepanov2024glinermultitaskgeneralistlightweight,
960
+ title={GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks},
961
+ author={Ihor Stepanov and Mykhailo Shtopko},
962
+ year={2024},
963
+ eprint={2406.12925},
964
+ archivePrefix={arXiv},
965
+ primaryClass={cs.LG},
966
+ url={https://arxiv.org/abs/2406.12925},
967
+ }
968
+ ```
969
+
970
+ ## Contributing
971
+
972
+ We welcome contributions! Areas of interest:
973
+
974
+ - **Database layers** (MongoDB, Neo4j, vector databases)
975
+ - **Performance optimizations**
976
+ - **Documentation improvements**
977
+
978
+ ## License
979
+
980
+ Apache 2.0 License — see [LICENSE](LICENSE) file for details.
981
+
982
+ ## Acknowledgments
983
+
984
+ - **GLiNER** — Zero-shot NER and entity linking ([urchade/GLiNER](https://github.com/urchade/GLiNER))
985
+ - **spaCy** — Industrial-strength NLP ([explosion/spaCy](https://github.com/explosion/spaCy))
986
+
987
+ ## Contact
988
+
989
+ - **GitHub**: [Knowledgator/GLinker](https://github.com/Knowledgator/GLinker)
990
+ - **Email**: info@knowledgator.com
991
+
992
+ ---
993
+
994
+ Developed by [Knowledgator](https://knowledgator.com)