chebilp 1.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- chebilp-1.0.0/LICENSE +21 -0
- chebilp-1.0.0/PKG-INFO +206 -0
- chebilp-1.0.0/README.md +183 -0
- chebilp-1.0.0/chebILP/__init__.py +0 -0
- chebilp-1.0.0/chebILP/__main__.py +3 -0
- chebilp-1.0.0/chebILP/cli.py +607 -0
- chebilp-1.0.0/chebILP/clingo_eval.py +106 -0
- chebilp-1.0.0/chebILP/data_preparation.py +213 -0
- chebilp-1.0.0/chebILP/enhance_with_llms.py +250 -0
- chebilp-1.0.0/chebILP/ensemble_eval.py +606 -0
- chebilp-1.0.0/chebILP/explain.py +295 -0
- chebilp-1.0.0/chebILP/fg_matching.py +121 -0
- chebilp-1.0.0/chebILP/ilp_classifier.py +177 -0
- chebilp-1.0.0/chebILP/ilp_path_manager.py +41 -0
- chebilp-1.0.0/chebILP/ilp_problem_builder.py +394 -0
- chebilp-1.0.0/chebILP/learn_fgs.py +58 -0
- chebilp-1.0.0/chebILP/mol_to_fol.py +110 -0
- chebilp-1.0.0/chebILP/prepare_dl_preds.py +106 -0
- chebilp-1.0.0/chebILP/rule_to_nl.py +358 -0
- chebilp-1.0.0/chebILP/select_predicates.py +326 -0
- chebilp-1.0.0/chebILP/test.py +152 -0
- chebilp-1.0.0/chebILP/utils.py +69 -0
- chebilp-1.0.0/chebilp.egg-info/PKG-INFO +206 -0
- chebilp-1.0.0/chebilp.egg-info/SOURCES.txt +28 -0
- chebilp-1.0.0/chebilp.egg-info/dependency_links.txt +1 -0
- chebilp-1.0.0/chebilp.egg-info/entry_points.txt +2 -0
- chebilp-1.0.0/chebilp.egg-info/requires.txt +16 -0
- chebilp-1.0.0/chebilp.egg-info/top_level.txt +1 -0
- chebilp-1.0.0/pyproject.toml +32 -0
- chebilp-1.0.0/setup.cfg +4 -0
chebilp-1.0.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 ChEB-AI
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
chebilp-1.0.0/PKG-INFO
ADDED
|
@@ -0,0 +1,206 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: chebilp
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: An Inductive Logic Programming framework for classifying chemical compounds into ChEBI classes.
|
|
5
|
+
Requires-Python: >=3.10
|
|
6
|
+
Description-Content-Type: text/markdown
|
|
7
|
+
License-File: LICENSE
|
|
8
|
+
Requires-Dist: chebi-utils>=0.2.1
|
|
9
|
+
Requires-Dist: clingo>=5.8.0
|
|
10
|
+
Requires-Dist: networkx>=3.6.1
|
|
11
|
+
Requires-Dist: numpy>=2.4.3
|
|
12
|
+
Requires-Dist: pandas>=3.0.1
|
|
13
|
+
Requires-Dist: rdkit>=2025.9.6
|
|
14
|
+
Requires-Dist: tqdm>=4.67.3
|
|
15
|
+
Provides-Extra: explain
|
|
16
|
+
Requires-Dist: xclingo>=2.0b14; extra == "explain"
|
|
17
|
+
Requires-Dist: Pillow>=12.1.1; extra == "explain"
|
|
18
|
+
Provides-Extra: llm
|
|
19
|
+
Requires-Dist: anthropic>=0.104.1; extra == "llm"
|
|
20
|
+
Requires-Dist: langsmith>=0.8.5; extra == "llm"
|
|
21
|
+
Requires-Dist: python-dotenv>=1.2.2; extra == "llm"
|
|
22
|
+
Dynamic: license-file
|
|
23
|
+
|
|
24
|
+
# chebILP
|
|
25
|
+
|
|
26
|
+
An Inductive Logic Programming (ILP) framework for classifying chemical compounds into [ChEBI](https://www.ebi.ac.uk/chebi/) classes. Rules are learned with [Popper](https://github.com/logic-and-learning-lab/Popper) and evaluated with [Clingo](https://potassco.org/clingo/) (Answer Set Programming).
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
## Installation
|
|
31
|
+
|
|
32
|
+
### Prerequesites
|
|
33
|
+
|
|
34
|
+
[SWI-Prolog](https://www.swi-prolog.org/Download.html) must be installed and on `PATH` (required by Popper).
|
|
35
|
+
Popper must be installed as well. You can either install the [latest version of Popper](https://github.com/logic-and-learning-lab/Popper) with
|
|
36
|
+
```
|
|
37
|
+
pip install https://github.com/logic-and-learning-lab/Popper
|
|
38
|
+
```
|
|
39
|
+
or a forked, slightly outdated version with
|
|
40
|
+
```
|
|
41
|
+
pip install https://github.com/sfluegel05/Popper
|
|
42
|
+
```
|
|
43
|
+
With the latter, you can use the `--mdl_weight_fn`, `--mdl_weight_fp` and `--mdl_weight_seize` options of the learn command.
|
|
44
|
+
|
|
45
|
+
### Core package
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
pip install chebILP
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
Extras:
|
|
52
|
+
- `pip install chebILP[explain]` adds `xclingo` and `Pillow` for the `explain` command
|
|
53
|
+
- `pip install chebILP[llm]` adds `anthropic`, `langsmith`, and `python-dotenv` for LLM-enhanced rule learning (`enhance_with_llms`, experimental)
|
|
54
|
+
|
|
55
|
+
|
|
56
|
+
The `prepare_dl_preds` utility (one-time DL tensor extraction) additionally requires `torch`, which must be installed separately in an environment that has the DL model checkpoint.
|
|
57
|
+
|
|
58
|
+
## Usage
|
|
59
|
+
To get a list of available commands, run
|
|
60
|
+
```bash
|
|
61
|
+
python -m chebILP -h
|
|
62
|
+
```
|
|
63
|
+
To get help for a specific command, run
|
|
64
|
+
```bash
|
|
65
|
+
python -m chebILP {command} -h
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
## Workflows
|
|
69
|
+
|
|
70
|
+
### 1. Generating new data
|
|
71
|
+
|
|
72
|
+
An ILP dataset for ChEBI version 248 is available on [HuggingFace](https://huggingface.co/datasets/chebai/ChEBI25-3STAR-ILP). However, you can also create your own dataset.
|
|
73
|
+
|
|
74
|
+
**Step 1 — Download ChEBI data and build the dataset** (downloads `chebi.obo` and `chebi.sdf.gz`, builds cached graph and molecule files, selects label classes, and creates a train/val/test split):
|
|
75
|
+
```bash
|
|
76
|
+
python -m chebILP prepare_dataset \
|
|
77
|
+
--chebi_version 248 \
|
|
78
|
+
--min_pos_samples 25
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
This writes to `data/chebi_v248/`:
|
|
82
|
+
- `chebi_graph.pkl` — hierarchy graph (networkx DiGraph)
|
|
83
|
+
- `molecules.pkl` — molecule DataFrame (index = ChEBI ID)
|
|
84
|
+
- `min50/labels.txt` — selected class IDs (one per line)
|
|
85
|
+
- `min50/splits.csv` — molecule-level train/val/test split
|
|
86
|
+
|
|
87
|
+
**Step 2 — Build ILP example files** (positive/negative molecules per class):
|
|
88
|
+
```bash
|
|
89
|
+
python -m chebILP build_samples \
|
|
90
|
+
--labels_file data/chebi_v248/ChEBI25_3_STAR/labels.txt \
|
|
91
|
+
--chebi_split data/chebi_v248/ChEBI25_3_STAR/splits.csv \
|
|
92
|
+
--chebi_graph_path data/chebi_v248/chebi_graph.pkl \
|
|
93
|
+
--molecules_path data/chebi_v248/ChEBI25_3_STAR/molecules.pkl
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
**Step 3 — Build ILP background knowledge files** (molecule features as logic facts):
|
|
97
|
+
```bash
|
|
98
|
+
python -m chebILP build_bk \
|
|
99
|
+
--labels_file data/chebi_v248/ChEBI25_3_STAR/labels.txt \
|
|
100
|
+
--chebi_split data/chebi_v248/ChEBI25_3_STAR/splits.csv \
|
|
101
|
+
--chebi_graph_path data/chebi_v248/chebi_graph.pkl \
|
|
102
|
+
--molecules_path data/chebi_v28/ChEBI25_3_STAR/molecules.pkl
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
Steps 2 and 3 write files into `data/ilp_problems/` (one subdirectory per class). Available predicate sets: `atoms`, `chembl_fgs`, `chebi_fgs`, `chebi_fg_rules` and `chebi_fg_learned_rules`.
|
|
106
|
+
|
|
107
|
+
---
|
|
108
|
+
|
|
109
|
+
### 2. Learning ILP rules
|
|
110
|
+
|
|
111
|
+
Learn Prolog classification rules for each class using the examples and background knowledge from workflow 1.
|
|
112
|
+
The learn function will create an updated bias file based on the `max_vars`, `max_body` and `max_clauses` parameters.
|
|
113
|
+
|
|
114
|
+
**Learn rules:**
|
|
115
|
+
```bash
|
|
116
|
+
python -m chebILP learn \
|
|
117
|
+
--labels_file data/chebi_v248/ChEBI25_3_STAR/labels.txt \
|
|
118
|
+
--chebi_split data/chebi_v248/ChEBI25_3_STAR/splits.csv \
|
|
119
|
+
--timeout 60
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Output is written to a timestamped directory `data/results/run_YYYYMMDD_HHMMSS/` containing `results.json` (one entry per class with the learned program and training score) and `config.yml`.
|
|
123
|
+
|
|
124
|
+
**Evaluate on test/validation set:**
|
|
125
|
+
```bash
|
|
126
|
+
python -m chebILP test \
|
|
127
|
+
--run_to_evaluate data/results/run_20260101_120000 \
|
|
128
|
+
--test_on test
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
**Optional: LLM-enhanced rules**
|
|
132
|
+
|
|
133
|
+
To improve learned programs with an LLM (requires `ANTHROPIC_API_KEY` in `.env`):
|
|
134
|
+
```bash
|
|
135
|
+
python -m chebILP.enhance_with_llms \
|
|
136
|
+
--input data/enhance_with_llms/best_ilp_programs_for_leaves.csv \
|
|
137
|
+
--output data/enhance_with_llms/enhanced_run \
|
|
138
|
+
--chebi_version 248
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
Input CSV must have columns `chebi_id`, `program`, `run_name`. The output directory is readable by the `test` command.
|
|
142
|
+
|
|
143
|
+
---
|
|
144
|
+
|
|
145
|
+
### 3. Building an ensemble (ILP + DL)
|
|
146
|
+
|
|
147
|
+
Combine ILP rules with a deep learning (DL) model for hierarchical multi-label classification. The ensemble uses DL predictions for non-leaf classes and selects either ILP or DL for each leaf class based on validation F1.
|
|
148
|
+
|
|
149
|
+
**Step 1 — Build full ILP prediction tensors** (run once per ILP run, for the validation and/or test split):
|
|
150
|
+
```bash
|
|
151
|
+
python -m chebILP build_ilp_preds_for_ensemble \
|
|
152
|
+
--run_dir data/results_val/run_20260101_120000 \
|
|
153
|
+
--predict_on validation \
|
|
154
|
+
--chebi_split data/chebi_v248/ChEBI25_3_STAR/processed/splits.csv \
|
|
155
|
+
--chebi_version 248
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
This writes `full_val_preds.npy` and `full_val_preds_metadata.json` into the run directory. Repeat with `--predict_on test` for the test split.
|
|
159
|
+
|
|
160
|
+
**Step 2 — Model selection and ILP tensor assembly:**
|
|
161
|
+
```bash
|
|
162
|
+
python -m chebILP ensemble_construct \
|
|
163
|
+
--chebi_split data/chebi_v248/ChEBI25_3_STAR/processed/splits.csv \
|
|
164
|
+
--dl_val_preds_npy data/preds/val_preds.npy \
|
|
165
|
+
--dl_val_preds_meta data/preds/val_preds_metadata.json \
|
|
166
|
+
--ilp_val_runs data/results_val/run_A data/results_val/run_B \
|
|
167
|
+
--label_stats data/chebi_v248/ChEBI25_3_STAR/processed/class_stats.csv \
|
|
168
|
+
--predict_on test \
|
|
169
|
+
--output data/ensemble_predictions/ensemble
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
For each leaf class, selects the ILP run whose ensemble F1 (ILP prediction AND all DL parent predictions >= 0.5) is highest; falls back to DL if no ILP run beats it. Outputs:
|
|
173
|
+
- `ensemble_trusted_models.csv` — which model is used per class
|
|
174
|
+
- `ensemble_ilp_preds.npy` + `ensemble_ilp_preds_metadata.json` — ILP tensor for the target split
|
|
175
|
+
|
|
176
|
+
**Step 3 — Aggregate into final predictions:**
|
|
177
|
+
```bash
|
|
178
|
+
python -m chebILP ensemble_aggregate \
|
|
179
|
+
--dl_preds_npy data/preds/test_preds.npy \
|
|
180
|
+
--dl_preds_meta data/preds/test_preds_metadata.json \
|
|
181
|
+
--ilp_preds_npy data/ensemble_predictions/ensemble_ilp_preds.npy \
|
|
182
|
+
--ilp_preds_meta data/ensemble_predictions/ensemble_ilp_preds_metadata.json \
|
|
183
|
+
--trusted_models data/ensemble_predictions/ensemble_trusted_models.csv \
|
|
184
|
+
--label_stats data/chebi_v248/ChEBI25_3_STAR/processed/class_stats.csv \
|
|
185
|
+
--output data/ensemble_predictions/final_predictions.npy
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
DL predictions propagate freely through the class hierarchy; ILP and always-positive classes only predict a class if all label-set parents are already predicted positive. Output is a boolean NumPy array with a matching `_metadata.json`.
|
|
189
|
+
|
|
190
|
+
---
|
|
191
|
+
|
|
192
|
+
## Other utilities
|
|
193
|
+
|
|
194
|
+
**Translate a rule to natural language:**
|
|
195
|
+
```bash
|
|
196
|
+
python -m chebILP rule_to_nl --rule_file my_rule.pl --class_parents data/class_parents.json
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
**Explain why a molecule satisfies a rule:**
|
|
200
|
+
```bash
|
|
201
|
+
python -m chebILP explain \
|
|
202
|
+
--smiles "CCO" \
|
|
203
|
+
--rule_file my_rule.pl \
|
|
204
|
+
--label_parents_json data/class_parents.json \
|
|
205
|
+
--output explanation.png
|
|
206
|
+
```
|
chebilp-1.0.0/README.md
ADDED
|
@@ -0,0 +1,183 @@
|
|
|
1
|
+
# chebILP
|
|
2
|
+
|
|
3
|
+
An Inductive Logic Programming (ILP) framework for classifying chemical compounds into [ChEBI](https://www.ebi.ac.uk/chebi/) classes. Rules are learned with [Popper](https://github.com/logic-and-learning-lab/Popper) and evaluated with [Clingo](https://potassco.org/clingo/) (Answer Set Programming).
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Installation
|
|
8
|
+
|
|
9
|
+
### Prerequesites
|
|
10
|
+
|
|
11
|
+
[SWI-Prolog](https://www.swi-prolog.org/Download.html) must be installed and on `PATH` (required by Popper).
|
|
12
|
+
Popper must be installed as well. You can either install the [latest version of Popper](https://github.com/logic-and-learning-lab/Popper) with
|
|
13
|
+
```
|
|
14
|
+
pip install https://github.com/logic-and-learning-lab/Popper
|
|
15
|
+
```
|
|
16
|
+
or a forked, slightly outdated version with
|
|
17
|
+
```
|
|
18
|
+
pip install https://github.com/sfluegel05/Popper
|
|
19
|
+
```
|
|
20
|
+
With the latter, you can use the `--mdl_weight_fn`, `--mdl_weight_fp` and `--mdl_weight_seize` options of the learn command.
|
|
21
|
+
|
|
22
|
+
### Core package
|
|
23
|
+
|
|
24
|
+
```bash
|
|
25
|
+
pip install chebILP
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
Extras:
|
|
29
|
+
- `pip install chebILP[explain]` adds `xclingo` and `Pillow` for the `explain` command
|
|
30
|
+
- `pip install chebILP[llm]` adds `anthropic`, `langsmith`, and `python-dotenv` for LLM-enhanced rule learning (`enhance_with_llms`, experimental)
|
|
31
|
+
|
|
32
|
+
|
|
33
|
+
The `prepare_dl_preds` utility (one-time DL tensor extraction) additionally requires `torch`, which must be installed separately in an environment that has the DL model checkpoint.
|
|
34
|
+
|
|
35
|
+
## Usage
|
|
36
|
+
To get a list of available commands, run
|
|
37
|
+
```bash
|
|
38
|
+
python -m chebILP -h
|
|
39
|
+
```
|
|
40
|
+
To get help for a specific command, run
|
|
41
|
+
```bash
|
|
42
|
+
python -m chebILP {command} -h
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
## Workflows
|
|
46
|
+
|
|
47
|
+
### 1. Generating new data
|
|
48
|
+
|
|
49
|
+
An ILP dataset for ChEBI version 248 is available on [HuggingFace](https://huggingface.co/datasets/chebai/ChEBI25-3STAR-ILP). However, you can also create your own dataset.
|
|
50
|
+
|
|
51
|
+
**Step 1 — Download ChEBI data and build the dataset** (downloads `chebi.obo` and `chebi.sdf.gz`, builds cached graph and molecule files, selects label classes, and creates a train/val/test split):
|
|
52
|
+
```bash
|
|
53
|
+
python -m chebILP prepare_dataset \
|
|
54
|
+
--chebi_version 248 \
|
|
55
|
+
--min_pos_samples 25
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
This writes to `data/chebi_v248/`:
|
|
59
|
+
- `chebi_graph.pkl` — hierarchy graph (networkx DiGraph)
|
|
60
|
+
- `molecules.pkl` — molecule DataFrame (index = ChEBI ID)
|
|
61
|
+
- `min50/labels.txt` — selected class IDs (one per line)
|
|
62
|
+
- `min50/splits.csv` — molecule-level train/val/test split
|
|
63
|
+
|
|
64
|
+
**Step 2 — Build ILP example files** (positive/negative molecules per class):
|
|
65
|
+
```bash
|
|
66
|
+
python -m chebILP build_samples \
|
|
67
|
+
--labels_file data/chebi_v248/ChEBI25_3_STAR/labels.txt \
|
|
68
|
+
--chebi_split data/chebi_v248/ChEBI25_3_STAR/splits.csv \
|
|
69
|
+
--chebi_graph_path data/chebi_v248/chebi_graph.pkl \
|
|
70
|
+
--molecules_path data/chebi_v248/ChEBI25_3_STAR/molecules.pkl
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
**Step 3 — Build ILP background knowledge files** (molecule features as logic facts):
|
|
74
|
+
```bash
|
|
75
|
+
python -m chebILP build_bk \
|
|
76
|
+
--labels_file data/chebi_v248/ChEBI25_3_STAR/labels.txt \
|
|
77
|
+
--chebi_split data/chebi_v248/ChEBI25_3_STAR/splits.csv \
|
|
78
|
+
--chebi_graph_path data/chebi_v248/chebi_graph.pkl \
|
|
79
|
+
--molecules_path data/chebi_v28/ChEBI25_3_STAR/molecules.pkl
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
Steps 2 and 3 write files into `data/ilp_problems/` (one subdirectory per class). Available predicate sets: `atoms`, `chembl_fgs`, `chebi_fgs`, `chebi_fg_rules` and `chebi_fg_learned_rules`.
|
|
83
|
+
|
|
84
|
+
---
|
|
85
|
+
|
|
86
|
+
### 2. Learning ILP rules
|
|
87
|
+
|
|
88
|
+
Learn Prolog classification rules for each class using the examples and background knowledge from workflow 1.
|
|
89
|
+
The learn function will create an updated bias file based on the `max_vars`, `max_body` and `max_clauses` parameters.
|
|
90
|
+
|
|
91
|
+
**Learn rules:**
|
|
92
|
+
```bash
|
|
93
|
+
python -m chebILP learn \
|
|
94
|
+
--labels_file data/chebi_v248/ChEBI25_3_STAR/labels.txt \
|
|
95
|
+
--chebi_split data/chebi_v248/ChEBI25_3_STAR/splits.csv \
|
|
96
|
+
--timeout 60
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
Output is written to a timestamped directory `data/results/run_YYYYMMDD_HHMMSS/` containing `results.json` (one entry per class with the learned program and training score) and `config.yml`.
|
|
100
|
+
|
|
101
|
+
**Evaluate on test/validation set:**
|
|
102
|
+
```bash
|
|
103
|
+
python -m chebILP test \
|
|
104
|
+
--run_to_evaluate data/results/run_20260101_120000 \
|
|
105
|
+
--test_on test
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
**Optional: LLM-enhanced rules**
|
|
109
|
+
|
|
110
|
+
To improve learned programs with an LLM (requires `ANTHROPIC_API_KEY` in `.env`):
|
|
111
|
+
```bash
|
|
112
|
+
python -m chebILP.enhance_with_llms \
|
|
113
|
+
--input data/enhance_with_llms/best_ilp_programs_for_leaves.csv \
|
|
114
|
+
--output data/enhance_with_llms/enhanced_run \
|
|
115
|
+
--chebi_version 248
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
Input CSV must have columns `chebi_id`, `program`, `run_name`. The output directory is readable by the `test` command.
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
### 3. Building an ensemble (ILP + DL)
|
|
123
|
+
|
|
124
|
+
Combine ILP rules with a deep learning (DL) model for hierarchical multi-label classification. The ensemble uses DL predictions for non-leaf classes and selects either ILP or DL for each leaf class based on validation F1.
|
|
125
|
+
|
|
126
|
+
**Step 1 — Build full ILP prediction tensors** (run once per ILP run, for the validation and/or test split):
|
|
127
|
+
```bash
|
|
128
|
+
python -m chebILP build_ilp_preds_for_ensemble \
|
|
129
|
+
--run_dir data/results_val/run_20260101_120000 \
|
|
130
|
+
--predict_on validation \
|
|
131
|
+
--chebi_split data/chebi_v248/ChEBI25_3_STAR/processed/splits.csv \
|
|
132
|
+
--chebi_version 248
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
This writes `full_val_preds.npy` and `full_val_preds_metadata.json` into the run directory. Repeat with `--predict_on test` for the test split.
|
|
136
|
+
|
|
137
|
+
**Step 2 — Model selection and ILP tensor assembly:**
|
|
138
|
+
```bash
|
|
139
|
+
python -m chebILP ensemble_construct \
|
|
140
|
+
--chebi_split data/chebi_v248/ChEBI25_3_STAR/processed/splits.csv \
|
|
141
|
+
--dl_val_preds_npy data/preds/val_preds.npy \
|
|
142
|
+
--dl_val_preds_meta data/preds/val_preds_metadata.json \
|
|
143
|
+
--ilp_val_runs data/results_val/run_A data/results_val/run_B \
|
|
144
|
+
--label_stats data/chebi_v248/ChEBI25_3_STAR/processed/class_stats.csv \
|
|
145
|
+
--predict_on test \
|
|
146
|
+
--output data/ensemble_predictions/ensemble
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
For each leaf class, selects the ILP run whose ensemble F1 (ILP prediction AND all DL parent predictions >= 0.5) is highest; falls back to DL if no ILP run beats it. Outputs:
|
|
150
|
+
- `ensemble_trusted_models.csv` — which model is used per class
|
|
151
|
+
- `ensemble_ilp_preds.npy` + `ensemble_ilp_preds_metadata.json` — ILP tensor for the target split
|
|
152
|
+
|
|
153
|
+
**Step 3 — Aggregate into final predictions:**
|
|
154
|
+
```bash
|
|
155
|
+
python -m chebILP ensemble_aggregate \
|
|
156
|
+
--dl_preds_npy data/preds/test_preds.npy \
|
|
157
|
+
--dl_preds_meta data/preds/test_preds_metadata.json \
|
|
158
|
+
--ilp_preds_npy data/ensemble_predictions/ensemble_ilp_preds.npy \
|
|
159
|
+
--ilp_preds_meta data/ensemble_predictions/ensemble_ilp_preds_metadata.json \
|
|
160
|
+
--trusted_models data/ensemble_predictions/ensemble_trusted_models.csv \
|
|
161
|
+
--label_stats data/chebi_v248/ChEBI25_3_STAR/processed/class_stats.csv \
|
|
162
|
+
--output data/ensemble_predictions/final_predictions.npy
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
DL predictions propagate freely through the class hierarchy; ILP and always-positive classes only predict a class if all label-set parents are already predicted positive. Output is a boolean NumPy array with a matching `_metadata.json`.
|
|
166
|
+
|
|
167
|
+
---
|
|
168
|
+
|
|
169
|
+
## Other utilities
|
|
170
|
+
|
|
171
|
+
**Translate a rule to natural language:**
|
|
172
|
+
```bash
|
|
173
|
+
python -m chebILP rule_to_nl --rule_file my_rule.pl --class_parents data/class_parents.json
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
**Explain why a molecule satisfies a rule:**
|
|
177
|
+
```bash
|
|
178
|
+
python -m chebILP explain \
|
|
179
|
+
--smiles "CCO" \
|
|
180
|
+
--rule_file my_rule.pl \
|
|
181
|
+
--label_parents_json data/class_parents.json \
|
|
182
|
+
--output explanation.png
|
|
183
|
+
```
|
|
File without changes
|