debase 0.1.3__tar.gz → 0.1.5__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {debase-0.1.3 → debase-0.1.5}/PKG-INFO +57 -1
- {debase-0.1.3 → debase-0.1.5}/README.md +56 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase/_version.py +1 -1
- {debase-0.1.3 → debase-0.1.5}/src/debase/enzyme_lineage_extractor.py +85 -9
- {debase-0.1.3 → debase-0.1.5}/src/debase.egg-info/PKG-INFO +57 -1
- {debase-0.1.3 → debase-0.1.5}/.gitignore +0 -0
- {debase-0.1.3 → debase-0.1.5}/CONTRIBUTING.md +0 -0
- {debase-0.1.3 → debase-0.1.5}/LICENSE +0 -0
- {debase-0.1.3 → debase-0.1.5}/MANIFEST.in +0 -0
- {debase-0.1.3 → debase-0.1.5}/docs/README.md +0 -0
- {debase-0.1.3 → debase-0.1.5}/docs/examples/README.md +0 -0
- {debase-0.1.3 → debase-0.1.5}/environment.yml +0 -0
- {debase-0.1.3 → debase-0.1.5}/pyproject.toml +0 -0
- {debase-0.1.3 → debase-0.1.5}/setup.cfg +0 -0
- {debase-0.1.3 → debase-0.1.5}/setup.py +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/__init__.py +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase/PIPELINE_FLOW.md +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase/__init__.py +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase/__main__.py +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase/build_db.py +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase/cleanup_sequence.py +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase/lineage_format.py +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase/reaction_info_extractor.py +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase/substrate_scope_extractor.py +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase/wrapper.py +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase.egg-info/SOURCES.txt +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase.egg-info/dependency_links.txt +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase.egg-info/entry_points.txt +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase.egg-info/requires.txt +0 -0
- {debase-0.1.3 → debase-0.1.5}/src/debase.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: debase
|
3
|
-
Version: 0.1.
|
3
|
+
Version: 0.1.5
|
4
4
|
Summary: Enzyme lineage analysis and sequence extraction package
|
5
5
|
Home-page: https://github.com/YuemingLong/DEBase
|
6
6
|
Author: DEBase Team
|
@@ -61,14 +61,70 @@ Enzyme lineage analysis and sequence extraction package with advanced parallel p
|
|
61
61
|
|
62
62
|
## Installation
|
63
63
|
|
64
|
+
### Quick Install (PyPI)
|
64
65
|
```bash
|
65
66
|
pip install debase
|
66
67
|
```
|
68
|
+
|
69
|
+
### Development Setup with Conda (Recommended)
|
70
|
+
|
71
|
+
1. **Clone the repository**
|
72
|
+
```bash
|
73
|
+
git clone https://github.com/YuemingLong/DEBase.git
|
74
|
+
cd DEBase
|
75
|
+
```
|
76
|
+
|
77
|
+
2. **Create conda environment from provided file**
|
78
|
+
```bash
|
79
|
+
conda env create -f environment.yml
|
80
|
+
conda activate debase
|
81
|
+
```
|
82
|
+
|
83
|
+
3. **Install DEBase in development mode**
|
84
|
+
```bash
|
85
|
+
pip install -e .
|
86
|
+
```
|
87
|
+
|
88
|
+
### Manual Setup
|
89
|
+
|
90
|
+
If you prefer to set up the environment manually:
|
91
|
+
|
92
|
+
```bash
|
93
|
+
# Create new conda environment
|
94
|
+
conda create -n debase python=3.9
|
95
|
+
conda activate debase
|
96
|
+
|
97
|
+
# Install conda packages
|
98
|
+
conda install -c conda-forge pandas numpy matplotlib seaborn jupyter jupyterlab openpyxl biopython requests tqdm
|
99
|
+
|
100
|
+
# Install RDKit (optional - used for SMILES canonicalization)
|
101
|
+
conda install -c conda-forge rdkit
|
102
|
+
|
103
|
+
# Install pip-only packages
|
104
|
+
pip install PyMuPDF google-generativeai debase
|
105
|
+
```
|
106
|
+
|
107
|
+
**Note about RDKit**: RDKit is optional and only used for canonicalizing SMILES strings in the output. If not installed, DEBase will still function normally but SMILES strings won't be standardized.
|
108
|
+
|
67
109
|
## Requirements
|
68
110
|
|
69
111
|
- Python 3.8 or higher
|
70
112
|
- A Gemini API key (set as environment variable `GEMINI_API_KEY`)
|
71
113
|
|
114
|
+
### Setting up Gemini API Key
|
115
|
+
|
116
|
+
```bash
|
117
|
+
# Option 1: Export in your shell
|
118
|
+
export GEMINI_API_KEY="your-api-key-here"
|
119
|
+
|
120
|
+
# Option 2: Add to ~/.bashrc or ~/.zshrc for persistence
|
121
|
+
echo 'export GEMINI_API_KEY="your-api-key-here"' >> ~/.bashrc
|
122
|
+
source ~/.bashrc
|
123
|
+
|
124
|
+
# Option 3: Create .env file in project directory
|
125
|
+
echo 'GEMINI_API_KEY=your-api-key-here' > .env
|
126
|
+
```
|
127
|
+
|
72
128
|
## Recent Updates
|
73
129
|
|
74
130
|
- **Campaign-Aware Extraction**: Automatically detects and processes multiple directed evolution campaigns in a single paper
|
@@ -4,14 +4,70 @@ Enzyme lineage analysis and sequence extraction package with advanced parallel p
|
|
4
4
|
|
5
5
|
## Installation
|
6
6
|
|
7
|
+
### Quick Install (PyPI)
|
7
8
|
```bash
|
8
9
|
pip install debase
|
9
10
|
```
|
11
|
+
|
12
|
+
### Development Setup with Conda (Recommended)
|
13
|
+
|
14
|
+
1. **Clone the repository**
|
15
|
+
```bash
|
16
|
+
git clone https://github.com/YuemingLong/DEBase.git
|
17
|
+
cd DEBase
|
18
|
+
```
|
19
|
+
|
20
|
+
2. **Create conda environment from provided file**
|
21
|
+
```bash
|
22
|
+
conda env create -f environment.yml
|
23
|
+
conda activate debase
|
24
|
+
```
|
25
|
+
|
26
|
+
3. **Install DEBase in development mode**
|
27
|
+
```bash
|
28
|
+
pip install -e .
|
29
|
+
```
|
30
|
+
|
31
|
+
### Manual Setup
|
32
|
+
|
33
|
+
If you prefer to set up the environment manually:
|
34
|
+
|
35
|
+
```bash
|
36
|
+
# Create new conda environment
|
37
|
+
conda create -n debase python=3.9
|
38
|
+
conda activate debase
|
39
|
+
|
40
|
+
# Install conda packages
|
41
|
+
conda install -c conda-forge pandas numpy matplotlib seaborn jupyter jupyterlab openpyxl biopython requests tqdm
|
42
|
+
|
43
|
+
# Install RDKit (optional - used for SMILES canonicalization)
|
44
|
+
conda install -c conda-forge rdkit
|
45
|
+
|
46
|
+
# Install pip-only packages
|
47
|
+
pip install PyMuPDF google-generativeai debase
|
48
|
+
```
|
49
|
+
|
50
|
+
**Note about RDKit**: RDKit is optional and only used for canonicalizing SMILES strings in the output. If not installed, DEBase will still function normally but SMILES strings won't be standardized.
|
51
|
+
|
10
52
|
## Requirements
|
11
53
|
|
12
54
|
- Python 3.8 or higher
|
13
55
|
- A Gemini API key (set as environment variable `GEMINI_API_KEY`)
|
14
56
|
|
57
|
+
### Setting up Gemini API Key
|
58
|
+
|
59
|
+
```bash
|
60
|
+
# Option 1: Export in your shell
|
61
|
+
export GEMINI_API_KEY="your-api-key-here"
|
62
|
+
|
63
|
+
# Option 2: Add to ~/.bashrc or ~/.zshrc for persistence
|
64
|
+
echo 'export GEMINI_API_KEY="your-api-key-here"' >> ~/.bashrc
|
65
|
+
source ~/.bashrc
|
66
|
+
|
67
|
+
# Option 3: Create .env file in project directory
|
68
|
+
echo 'GEMINI_API_KEY=your-api-key-here' > .env
|
69
|
+
```
|
70
|
+
|
15
71
|
## Recent Updates
|
16
72
|
|
17
73
|
- **Campaign-Aware Extraction**: Automatically detects and processes multiple directed evolution campaigns in a single paper
|
@@ -800,15 +800,36 @@ def identify_evolution_locations(
|
|
800
800
|
_dump(f"=== CAMPAIGN MAPPING PROMPT ===\nLocation: {location_str}\n{'='*80}\n\n{mapping_prompt}", mapping_file)
|
801
801
|
|
802
802
|
response = model.generate_content(mapping_prompt)
|
803
|
-
|
803
|
+
response_text = _extract_text(response).strip()
|
804
|
+
|
805
|
+
# Extract just the campaign_id from the response
|
806
|
+
# Look for the campaign_id pattern in the response
|
807
|
+
campaign_id = None
|
808
|
+
for campaign in campaigns:
|
809
|
+
if hasattr(campaign, 'campaign_id') and campaign.campaign_id in response_text:
|
810
|
+
campaign_id = campaign.campaign_id
|
811
|
+
break
|
812
|
+
|
813
|
+
# If not found, try to extract the last line or quoted string
|
814
|
+
if not campaign_id:
|
815
|
+
# Try to find quoted string
|
816
|
+
quoted_match = re.search(r'"([^"]+)"', response_text)
|
817
|
+
if quoted_match:
|
818
|
+
campaign_id = quoted_match.group(1)
|
819
|
+
else:
|
820
|
+
# Take the last non-empty line
|
821
|
+
lines = [line.strip() for line in response_text.split('\n') if line.strip()]
|
822
|
+
if lines:
|
823
|
+
campaign_id = lines[-1].strip('"')
|
804
824
|
|
805
825
|
# Save mapping response to debug if provided
|
806
826
|
if debug_dir:
|
807
827
|
response_file = debug_path / f"campaign_mapping_response_{location_str.replace(' ', '_')}_{int(time.time())}.txt"
|
808
|
-
_dump(f"=== CAMPAIGN MAPPING RESPONSE ===\nLocation: {location_str}\
|
828
|
+
_dump(f"=== CAMPAIGN MAPPING RESPONSE ===\nLocation: {location_str}\nFull response:\n{response_text}\nExtracted campaign_id: {campaign_id}\n{'='*80}", response_file)
|
809
829
|
|
810
830
|
# Add campaign_id to location
|
811
|
-
|
831
|
+
if campaign_id:
|
832
|
+
loc['campaign_id'] = campaign_id
|
812
833
|
log.info(f"Mapped {location_str} to campaign: {campaign_id}")
|
813
834
|
except Exception as exc:
|
814
835
|
log.warning(f"Failed to map location to campaign: {exc}")
|
@@ -1003,6 +1024,38 @@ IMPORTANT: Only extract variants that belong to this specific campaign.
|
|
1003
1024
|
|
1004
1025
|
# ---- 6.3 Helper for location-based extraction -----------------------------
|
1005
1026
|
|
1027
|
+
def _is_toc_entry(text: str, position: int, pattern: str) -> bool:
|
1028
|
+
"""Check if a found pattern is likely a table of contents entry."""
|
1029
|
+
# Find the line containing this position
|
1030
|
+
line_start = text.rfind('\n', 0, position)
|
1031
|
+
line_end = text.find('\n', position)
|
1032
|
+
|
1033
|
+
if line_start == -1:
|
1034
|
+
line_start = 0
|
1035
|
+
else:
|
1036
|
+
line_start += 1
|
1037
|
+
|
1038
|
+
if line_end == -1:
|
1039
|
+
line_end = len(text)
|
1040
|
+
|
1041
|
+
line = text[line_start:line_end]
|
1042
|
+
|
1043
|
+
# TOC indicators:
|
1044
|
+
# 1. Line contains dots (...) followed by page number
|
1045
|
+
# 2. Line ends with just a page number
|
1046
|
+
# 3. Line has "Table S12:" or similar followed by title and page
|
1047
|
+
if '...' in line or re.search(r'\.\s*\d+\s*$', line) or re.search(r':\s*[^:]+\s+\d+\s*$', line):
|
1048
|
+
return True
|
1049
|
+
|
1050
|
+
# Check if this is in a contents/TOC section
|
1051
|
+
# Look backwards up to 500 chars for "Contents" or "Table of Contents"
|
1052
|
+
context_start = max(0, position - 500)
|
1053
|
+
context = text[context_start:position].lower()
|
1054
|
+
if 'contents' in context or 'table of contents' in context:
|
1055
|
+
return True
|
1056
|
+
|
1057
|
+
return False
|
1058
|
+
|
1006
1059
|
def _extract_text_at_locations(text: str, locations: List[Union[str, dict]], context_chars: int = 5000, validate_sequences: bool = False) -> str:
|
1007
1060
|
"""Extract text around identified locations."""
|
1008
1061
|
if not locations:
|
@@ -1061,11 +1114,25 @@ def _extract_text_at_locations(text: str, locations: List[Union[str, dict]], con
|
|
1061
1114
|
pos = -1
|
1062
1115
|
used_pattern = None
|
1063
1116
|
for pattern in page_patterns:
|
1064
|
-
|
1065
|
-
|
1117
|
+
search_pos = 0
|
1118
|
+
while search_pos < len(text_lower):
|
1119
|
+
temp_pos = text_lower.find(pattern.lower(), search_pos)
|
1120
|
+
if temp_pos == -1:
|
1121
|
+
break
|
1122
|
+
|
1123
|
+
# Check if this is a TOC entry
|
1124
|
+
if _is_toc_entry(text, temp_pos, pattern):
|
1125
|
+
log.debug("Skipping TOC entry for pattern '%s' at position %d", pattern, temp_pos)
|
1126
|
+
search_pos = temp_pos + len(pattern)
|
1127
|
+
continue
|
1128
|
+
|
1129
|
+
# Found non-TOC entry
|
1066
1130
|
pos = temp_pos
|
1067
1131
|
used_pattern = pattern
|
1068
|
-
log.debug("Found pattern '%s' at position %d", pattern, pos)
|
1132
|
+
log.debug("Found pattern '%s' at position %d (not TOC)", pattern, pos)
|
1133
|
+
break
|
1134
|
+
|
1135
|
+
if pos != -1:
|
1069
1136
|
break
|
1070
1137
|
|
1071
1138
|
if pos != -1:
|
@@ -1254,7 +1321,9 @@ def get_lineage(
|
|
1254
1321
|
|
1255
1322
|
# Use text-based extraction (works for tables and text sections)
|
1256
1323
|
# Extract from full text, not caption text - use only primary location
|
1257
|
-
|
1324
|
+
# Use more context for tables since they often span multiple pages
|
1325
|
+
context_size = 15000 if location_type == 'table' else 5000
|
1326
|
+
focused_text = _extract_text_at_locations(full_text, [primary_location], context_chars=context_size)
|
1258
1327
|
log.info("Reduced text from %d to %d chars using primary location %s for campaign %s",
|
1259
1328
|
len(full_text), len(focused_text),
|
1260
1329
|
primary_location.get('location', 'Unknown') if isinstance(primary_location, dict) else 'Unknown',
|
@@ -2038,8 +2107,15 @@ def run_pipeline(
|
|
2038
2107
|
sequences = get_sequences(full_text, model, pdf_paths=pdf_paths, debug_dir=debug_dir)
|
2039
2108
|
|
2040
2109
|
# 4a. Try PDB extraction if no sequences found -----------------------------
|
2041
|
-
if
|
2042
|
-
|
2110
|
+
# Check if we need PDB sequences (no sequences or only partial sequences)
|
2111
|
+
MIN_PROTEIN_LENGTH = 50 # Most proteins are >50 AA
|
2112
|
+
needs_pdb = (not sequences or
|
2113
|
+
all(s.aa_seq is None or (s.aa_seq and len(s.aa_seq) < MIN_PROTEIN_LENGTH)
|
2114
|
+
for s in sequences))
|
2115
|
+
|
2116
|
+
if needs_pdb:
|
2117
|
+
log.info("No full-length sequences found in paper (only partial sequences < %d AA), attempting PDB extraction...",
|
2118
|
+
MIN_PROTEIN_LENGTH)
|
2043
2119
|
|
2044
2120
|
# Extract PDB IDs from all PDFs
|
2045
2121
|
pdb_ids = []
|
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: debase
|
3
|
-
Version: 0.1.
|
3
|
+
Version: 0.1.5
|
4
4
|
Summary: Enzyme lineage analysis and sequence extraction package
|
5
5
|
Home-page: https://github.com/YuemingLong/DEBase
|
6
6
|
Author: DEBase Team
|
@@ -61,14 +61,70 @@ Enzyme lineage analysis and sequence extraction package with advanced parallel p
|
|
61
61
|
|
62
62
|
## Installation
|
63
63
|
|
64
|
+
### Quick Install (PyPI)
|
64
65
|
```bash
|
65
66
|
pip install debase
|
66
67
|
```
|
68
|
+
|
69
|
+
### Development Setup with Conda (Recommended)
|
70
|
+
|
71
|
+
1. **Clone the repository**
|
72
|
+
```bash
|
73
|
+
git clone https://github.com/YuemingLong/DEBase.git
|
74
|
+
cd DEBase
|
75
|
+
```
|
76
|
+
|
77
|
+
2. **Create conda environment from provided file**
|
78
|
+
```bash
|
79
|
+
conda env create -f environment.yml
|
80
|
+
conda activate debase
|
81
|
+
```
|
82
|
+
|
83
|
+
3. **Install DEBase in development mode**
|
84
|
+
```bash
|
85
|
+
pip install -e .
|
86
|
+
```
|
87
|
+
|
88
|
+
### Manual Setup
|
89
|
+
|
90
|
+
If you prefer to set up the environment manually:
|
91
|
+
|
92
|
+
```bash
|
93
|
+
# Create new conda environment
|
94
|
+
conda create -n debase python=3.9
|
95
|
+
conda activate debase
|
96
|
+
|
97
|
+
# Install conda packages
|
98
|
+
conda install -c conda-forge pandas numpy matplotlib seaborn jupyter jupyterlab openpyxl biopython requests tqdm
|
99
|
+
|
100
|
+
# Install RDKit (optional - used for SMILES canonicalization)
|
101
|
+
conda install -c conda-forge rdkit
|
102
|
+
|
103
|
+
# Install pip-only packages
|
104
|
+
pip install PyMuPDF google-generativeai debase
|
105
|
+
```
|
106
|
+
|
107
|
+
**Note about RDKit**: RDKit is optional and only used for canonicalizing SMILES strings in the output. If not installed, DEBase will still function normally but SMILES strings won't be standardized.
|
108
|
+
|
67
109
|
## Requirements
|
68
110
|
|
69
111
|
- Python 3.8 or higher
|
70
112
|
- A Gemini API key (set as environment variable `GEMINI_API_KEY`)
|
71
113
|
|
114
|
+
### Setting up Gemini API Key
|
115
|
+
|
116
|
+
```bash
|
117
|
+
# Option 1: Export in your shell
|
118
|
+
export GEMINI_API_KEY="your-api-key-here"
|
119
|
+
|
120
|
+
# Option 2: Add to ~/.bashrc or ~/.zshrc for persistence
|
121
|
+
echo 'export GEMINI_API_KEY="your-api-key-here"' >> ~/.bashrc
|
122
|
+
source ~/.bashrc
|
123
|
+
|
124
|
+
# Option 3: Create .env file in project directory
|
125
|
+
echo 'GEMINI_API_KEY=your-api-key-here' > .env
|
126
|
+
```
|
127
|
+
|
72
128
|
## Recent Updates
|
73
129
|
|
74
130
|
- **Campaign-Aware Extraction**: Automatically detects and processes multiple directed evolution campaigns in a single paper
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|