debase 0.1.3__tar.gz → 0.1.5__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (30) hide show
  1. {debase-0.1.3 → debase-0.1.5}/PKG-INFO +57 -1
  2. {debase-0.1.3 → debase-0.1.5}/README.md +56 -0
  3. {debase-0.1.3 → debase-0.1.5}/src/debase/_version.py +1 -1
  4. {debase-0.1.3 → debase-0.1.5}/src/debase/enzyme_lineage_extractor.py +85 -9
  5. {debase-0.1.3 → debase-0.1.5}/src/debase.egg-info/PKG-INFO +57 -1
  6. {debase-0.1.3 → debase-0.1.5}/.gitignore +0 -0
  7. {debase-0.1.3 → debase-0.1.5}/CONTRIBUTING.md +0 -0
  8. {debase-0.1.3 → debase-0.1.5}/LICENSE +0 -0
  9. {debase-0.1.3 → debase-0.1.5}/MANIFEST.in +0 -0
  10. {debase-0.1.3 → debase-0.1.5}/docs/README.md +0 -0
  11. {debase-0.1.3 → debase-0.1.5}/docs/examples/README.md +0 -0
  12. {debase-0.1.3 → debase-0.1.5}/environment.yml +0 -0
  13. {debase-0.1.3 → debase-0.1.5}/pyproject.toml +0 -0
  14. {debase-0.1.3 → debase-0.1.5}/setup.cfg +0 -0
  15. {debase-0.1.3 → debase-0.1.5}/setup.py +0 -0
  16. {debase-0.1.3 → debase-0.1.5}/src/__init__.py +0 -0
  17. {debase-0.1.3 → debase-0.1.5}/src/debase/PIPELINE_FLOW.md +0 -0
  18. {debase-0.1.3 → debase-0.1.5}/src/debase/__init__.py +0 -0
  19. {debase-0.1.3 → debase-0.1.5}/src/debase/__main__.py +0 -0
  20. {debase-0.1.3 → debase-0.1.5}/src/debase/build_db.py +0 -0
  21. {debase-0.1.3 → debase-0.1.5}/src/debase/cleanup_sequence.py +0 -0
  22. {debase-0.1.3 → debase-0.1.5}/src/debase/lineage_format.py +0 -0
  23. {debase-0.1.3 → debase-0.1.5}/src/debase/reaction_info_extractor.py +0 -0
  24. {debase-0.1.3 → debase-0.1.5}/src/debase/substrate_scope_extractor.py +0 -0
  25. {debase-0.1.3 → debase-0.1.5}/src/debase/wrapper.py +0 -0
  26. {debase-0.1.3 → debase-0.1.5}/src/debase.egg-info/SOURCES.txt +0 -0
  27. {debase-0.1.3 → debase-0.1.5}/src/debase.egg-info/dependency_links.txt +0 -0
  28. {debase-0.1.3 → debase-0.1.5}/src/debase.egg-info/entry_points.txt +0 -0
  29. {debase-0.1.3 → debase-0.1.5}/src/debase.egg-info/requires.txt +0 -0
  30. {debase-0.1.3 → debase-0.1.5}/src/debase.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: debase
3
- Version: 0.1.3
3
+ Version: 0.1.5
4
4
  Summary: Enzyme lineage analysis and sequence extraction package
5
5
  Home-page: https://github.com/YuemingLong/DEBase
6
6
  Author: DEBase Team
@@ -61,14 +61,70 @@ Enzyme lineage analysis and sequence extraction package with advanced parallel p
61
61
 
62
62
  ## Installation
63
63
 
64
+ ### Quick Install (PyPI)
64
65
  ```bash
65
66
  pip install debase
66
67
  ```
68
+
69
+ ### Development Setup with Conda (Recommended)
70
+
71
+ 1. **Clone the repository**
72
+ ```bash
73
+ git clone https://github.com/YuemingLong/DEBase.git
74
+ cd DEBase
75
+ ```
76
+
77
+ 2. **Create conda environment from provided file**
78
+ ```bash
79
+ conda env create -f environment.yml
80
+ conda activate debase
81
+ ```
82
+
83
+ 3. **Install DEBase in development mode**
84
+ ```bash
85
+ pip install -e .
86
+ ```
87
+
88
+ ### Manual Setup
89
+
90
+ If you prefer to set up the environment manually:
91
+
92
+ ```bash
93
+ # Create new conda environment
94
+ conda create -n debase python=3.9
95
+ conda activate debase
96
+
97
+ # Install conda packages
98
+ conda install -c conda-forge pandas numpy matplotlib seaborn jupyter jupyterlab openpyxl biopython requests tqdm
99
+
100
+ # Install RDKit (optional - used for SMILES canonicalization)
101
+ conda install -c conda-forge rdkit
102
+
103
+ # Install pip-only packages
104
+ pip install PyMuPDF google-generativeai debase
105
+ ```
106
+
107
+ **Note about RDKit**: RDKit is optional and only used for canonicalizing SMILES strings in the output. If not installed, DEBase will still function normally but SMILES strings won't be standardized.
108
+
67
109
  ## Requirements
68
110
 
69
111
  - Python 3.8 or higher
70
112
  - A Gemini API key (set as environment variable `GEMINI_API_KEY`)
71
113
 
114
+ ### Setting up Gemini API Key
115
+
116
+ ```bash
117
+ # Option 1: Export in your shell
118
+ export GEMINI_API_KEY="your-api-key-here"
119
+
120
+ # Option 2: Add to ~/.bashrc or ~/.zshrc for persistence
121
+ echo 'export GEMINI_API_KEY="your-api-key-here"' >> ~/.bashrc
122
+ source ~/.bashrc
123
+
124
+ # Option 3: Create .env file in project directory
125
+ echo 'GEMINI_API_KEY=your-api-key-here' > .env
126
+ ```
127
+
72
128
  ## Recent Updates
73
129
 
74
130
  - **Campaign-Aware Extraction**: Automatically detects and processes multiple directed evolution campaigns in a single paper
@@ -4,14 +4,70 @@ Enzyme lineage analysis and sequence extraction package with advanced parallel p
4
4
 
5
5
  ## Installation
6
6
 
7
+ ### Quick Install (PyPI)
7
8
  ```bash
8
9
  pip install debase
9
10
  ```
11
+
12
+ ### Development Setup with Conda (Recommended)
13
+
14
+ 1. **Clone the repository**
15
+ ```bash
16
+ git clone https://github.com/YuemingLong/DEBase.git
17
+ cd DEBase
18
+ ```
19
+
20
+ 2. **Create conda environment from provided file**
21
+ ```bash
22
+ conda env create -f environment.yml
23
+ conda activate debase
24
+ ```
25
+
26
+ 3. **Install DEBase in development mode**
27
+ ```bash
28
+ pip install -e .
29
+ ```
30
+
31
+ ### Manual Setup
32
+
33
+ If you prefer to set up the environment manually:
34
+
35
+ ```bash
36
+ # Create new conda environment
37
+ conda create -n debase python=3.9
38
+ conda activate debase
39
+
40
+ # Install conda packages
41
+ conda install -c conda-forge pandas numpy matplotlib seaborn jupyter jupyterlab openpyxl biopython requests tqdm
42
+
43
+ # Install RDKit (optional - used for SMILES canonicalization)
44
+ conda install -c conda-forge rdkit
45
+
46
+ # Install pip-only packages
47
+ pip install PyMuPDF google-generativeai debase
48
+ ```
49
+
50
+ **Note about RDKit**: RDKit is optional and only used for canonicalizing SMILES strings in the output. If not installed, DEBase will still function normally but SMILES strings won't be standardized.
51
+
10
52
  ## Requirements
11
53
 
12
54
  - Python 3.8 or higher
13
55
  - A Gemini API key (set as environment variable `GEMINI_API_KEY`)
14
56
 
57
+ ### Setting up Gemini API Key
58
+
59
+ ```bash
60
+ # Option 1: Export in your shell
61
+ export GEMINI_API_KEY="your-api-key-here"
62
+
63
+ # Option 2: Add to ~/.bashrc or ~/.zshrc for persistence
64
+ echo 'export GEMINI_API_KEY="your-api-key-here"' >> ~/.bashrc
65
+ source ~/.bashrc
66
+
67
+ # Option 3: Create .env file in project directory
68
+ echo 'GEMINI_API_KEY=your-api-key-here' > .env
69
+ ```
70
+
15
71
  ## Recent Updates
16
72
 
17
73
  - **Campaign-Aware Extraction**: Automatically detects and processes multiple directed evolution campaigns in a single paper
@@ -1,3 +1,3 @@
1
1
  """Version information."""
2
2
 
3
- __version__ = "0.1.3"
3
+ __version__ = "0.1.5"
@@ -800,15 +800,36 @@ def identify_evolution_locations(
800
800
  _dump(f"=== CAMPAIGN MAPPING PROMPT ===\nLocation: {location_str}\n{'='*80}\n\n{mapping_prompt}", mapping_file)
801
801
 
802
802
  response = model.generate_content(mapping_prompt)
803
- campaign_id = _extract_text(response).strip().strip('"')
803
+ response_text = _extract_text(response).strip()
804
+
805
+ # Extract just the campaign_id from the response
806
+ # Look for the campaign_id pattern in the response
807
+ campaign_id = None
808
+ for campaign in campaigns:
809
+ if hasattr(campaign, 'campaign_id') and campaign.campaign_id in response_text:
810
+ campaign_id = campaign.campaign_id
811
+ break
812
+
813
+ # If not found, try to extract the last line or quoted string
814
+ if not campaign_id:
815
+ # Try to find quoted string
816
+ quoted_match = re.search(r'"([^"]+)"', response_text)
817
+ if quoted_match:
818
+ campaign_id = quoted_match.group(1)
819
+ else:
820
+ # Take the last non-empty line
821
+ lines = [line.strip() for line in response_text.split('\n') if line.strip()]
822
+ if lines:
823
+ campaign_id = lines[-1].strip('"')
804
824
 
805
825
  # Save mapping response to debug if provided
806
826
  if debug_dir:
807
827
  response_file = debug_path / f"campaign_mapping_response_{location_str.replace(' ', '_')}_{int(time.time())}.txt"
808
- _dump(f"=== CAMPAIGN MAPPING RESPONSE ===\nLocation: {location_str}\nMapped to: {campaign_id}\n{'='*80}\n\n{_extract_text(response)}", response_file)
828
+ _dump(f"=== CAMPAIGN MAPPING RESPONSE ===\nLocation: {location_str}\nFull response:\n{response_text}\nExtracted campaign_id: {campaign_id}\n{'='*80}", response_file)
809
829
 
810
830
  # Add campaign_id to location
811
- loc['campaign_id'] = campaign_id
831
+ if campaign_id:
832
+ loc['campaign_id'] = campaign_id
812
833
  log.info(f"Mapped {location_str} to campaign: {campaign_id}")
813
834
  except Exception as exc:
814
835
  log.warning(f"Failed to map location to campaign: {exc}")
@@ -1003,6 +1024,38 @@ IMPORTANT: Only extract variants that belong to this specific campaign.
1003
1024
 
1004
1025
  # ---- 6.3 Helper for location-based extraction -----------------------------
1005
1026
 
1027
+ def _is_toc_entry(text: str, position: int, pattern: str) -> bool:
1028
+ """Check if a found pattern is likely a table of contents entry."""
1029
+ # Find the line containing this position
1030
+ line_start = text.rfind('\n', 0, position)
1031
+ line_end = text.find('\n', position)
1032
+
1033
+ if line_start == -1:
1034
+ line_start = 0
1035
+ else:
1036
+ line_start += 1
1037
+
1038
+ if line_end == -1:
1039
+ line_end = len(text)
1040
+
1041
+ line = text[line_start:line_end]
1042
+
1043
+ # TOC indicators:
1044
+ # 1. Line contains dots (...) followed by page number
1045
+ # 2. Line ends with just a page number
1046
+ # 3. Line has "Table S12:" or similar followed by title and page
1047
+ if '...' in line or re.search(r'\.\s*\d+\s*$', line) or re.search(r':\s*[^:]+\s+\d+\s*$', line):
1048
+ return True
1049
+
1050
+ # Check if this is in a contents/TOC section
1051
+ # Look backwards up to 500 chars for "Contents" or "Table of Contents"
1052
+ context_start = max(0, position - 500)
1053
+ context = text[context_start:position].lower()
1054
+ if 'contents' in context or 'table of contents' in context:
1055
+ return True
1056
+
1057
+ return False
1058
+
1006
1059
  def _extract_text_at_locations(text: str, locations: List[Union[str, dict]], context_chars: int = 5000, validate_sequences: bool = False) -> str:
1007
1060
  """Extract text around identified locations."""
1008
1061
  if not locations:
@@ -1061,11 +1114,25 @@ def _extract_text_at_locations(text: str, locations: List[Union[str, dict]], con
1061
1114
  pos = -1
1062
1115
  used_pattern = None
1063
1116
  for pattern in page_patterns:
1064
- temp_pos = text_lower.find(pattern.lower())
1065
- if temp_pos != -1:
1117
+ search_pos = 0
1118
+ while search_pos < len(text_lower):
1119
+ temp_pos = text_lower.find(pattern.lower(), search_pos)
1120
+ if temp_pos == -1:
1121
+ break
1122
+
1123
+ # Check if this is a TOC entry
1124
+ if _is_toc_entry(text, temp_pos, pattern):
1125
+ log.debug("Skipping TOC entry for pattern '%s' at position %d", pattern, temp_pos)
1126
+ search_pos = temp_pos + len(pattern)
1127
+ continue
1128
+
1129
+ # Found non-TOC entry
1066
1130
  pos = temp_pos
1067
1131
  used_pattern = pattern
1068
- log.debug("Found pattern '%s' at position %d", pattern, pos)
1132
+ log.debug("Found pattern '%s' at position %d (not TOC)", pattern, pos)
1133
+ break
1134
+
1135
+ if pos != -1:
1069
1136
  break
1070
1137
 
1071
1138
  if pos != -1:
@@ -1254,7 +1321,9 @@ def get_lineage(
1254
1321
 
1255
1322
  # Use text-based extraction (works for tables and text sections)
1256
1323
  # Extract from full text, not caption text - use only primary location
1257
- focused_text = _extract_text_at_locations(full_text, [primary_location])
1324
+ # Use more context for tables since they often span multiple pages
1325
+ context_size = 15000 if location_type == 'table' else 5000
1326
+ focused_text = _extract_text_at_locations(full_text, [primary_location], context_chars=context_size)
1258
1327
  log.info("Reduced text from %d to %d chars using primary location %s for campaign %s",
1259
1328
  len(full_text), len(focused_text),
1260
1329
  primary_location.get('location', 'Unknown') if isinstance(primary_location, dict) else 'Unknown',
@@ -2038,8 +2107,15 @@ def run_pipeline(
2038
2107
  sequences = get_sequences(full_text, model, pdf_paths=pdf_paths, debug_dir=debug_dir)
2039
2108
 
2040
2109
  # 4a. Try PDB extraction if no sequences found -----------------------------
2041
- if not sequences or all(s.aa_seq is None for s in sequences):
2042
- log.info("No sequences found in paper, attempting PDB extraction...")
2110
+ # Check if we need PDB sequences (no sequences or only partial sequences)
2111
+ MIN_PROTEIN_LENGTH = 50 # Most proteins are >50 AA
2112
+ needs_pdb = (not sequences or
2113
+ all(s.aa_seq is None or (s.aa_seq and len(s.aa_seq) < MIN_PROTEIN_LENGTH)
2114
+ for s in sequences))
2115
+
2116
+ if needs_pdb:
2117
+ log.info("No full-length sequences found in paper (only partial sequences < %d AA), attempting PDB extraction...",
2118
+ MIN_PROTEIN_LENGTH)
2043
2119
 
2044
2120
  # Extract PDB IDs from all PDFs
2045
2121
  pdb_ids = []
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: debase
3
- Version: 0.1.3
3
+ Version: 0.1.5
4
4
  Summary: Enzyme lineage analysis and sequence extraction package
5
5
  Home-page: https://github.com/YuemingLong/DEBase
6
6
  Author: DEBase Team
@@ -61,14 +61,70 @@ Enzyme lineage analysis and sequence extraction package with advanced parallel p
61
61
 
62
62
  ## Installation
63
63
 
64
+ ### Quick Install (PyPI)
64
65
  ```bash
65
66
  pip install debase
66
67
  ```
68
+
69
+ ### Development Setup with Conda (Recommended)
70
+
71
+ 1. **Clone the repository**
72
+ ```bash
73
+ git clone https://github.com/YuemingLong/DEBase.git
74
+ cd DEBase
75
+ ```
76
+
77
+ 2. **Create conda environment from provided file**
78
+ ```bash
79
+ conda env create -f environment.yml
80
+ conda activate debase
81
+ ```
82
+
83
+ 3. **Install DEBase in development mode**
84
+ ```bash
85
+ pip install -e .
86
+ ```
87
+
88
+ ### Manual Setup
89
+
90
+ If you prefer to set up the environment manually:
91
+
92
+ ```bash
93
+ # Create new conda environment
94
+ conda create -n debase python=3.9
95
+ conda activate debase
96
+
97
+ # Install conda packages
98
+ conda install -c conda-forge pandas numpy matplotlib seaborn jupyter jupyterlab openpyxl biopython requests tqdm
99
+
100
+ # Install RDKit (optional - used for SMILES canonicalization)
101
+ conda install -c conda-forge rdkit
102
+
103
+ # Install pip-only packages
104
+ pip install PyMuPDF google-generativeai debase
105
+ ```
106
+
107
+ **Note about RDKit**: RDKit is optional and only used for canonicalizing SMILES strings in the output. If not installed, DEBase will still function normally but SMILES strings won't be standardized.
108
+
67
109
  ## Requirements
68
110
 
69
111
  - Python 3.8 or higher
70
112
  - A Gemini API key (set as environment variable `GEMINI_API_KEY`)
71
113
 
114
+ ### Setting up Gemini API Key
115
+
116
+ ```bash
117
+ # Option 1: Export in your shell
118
+ export GEMINI_API_KEY="your-api-key-here"
119
+
120
+ # Option 2: Add to ~/.bashrc or ~/.zshrc for persistence
121
+ echo 'export GEMINI_API_KEY="your-api-key-here"' >> ~/.bashrc
122
+ source ~/.bashrc
123
+
124
+ # Option 3: Create .env file in project directory
125
+ echo 'GEMINI_API_KEY=your-api-key-here' > .env
126
+ ```
127
+
72
128
  ## Recent Updates
73
129
 
74
130
  - **Campaign-Aware Extraction**: Automatically detects and processes multiple directed evolution campaigns in a single paper
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes