PyPI - debase - Versions diffs - 0.1.1__py3-none-any.whl → 0.1.3__py3-none-any.whl - Mend

debase 0.1.1py3-none-any.whl → 0.1.3py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

debase/PIPELINE_FLOW.md +100 -0
debase/_version.py +1 -1
debase/enzyme_lineage_extractor.py +10 -1
debase/reaction_info_extractor.py +52 -7
{debase-0.1.1.dist-info → debase-0.1.3.dist-info}/METADATA +2 -61
debase-0.1.3.dist-info/RECORD +17 -0
debase-0.1.1.dist-info/RECORD +0 -16
{debase-0.1.1.dist-info → debase-0.1.3.dist-info}/WHEEL +0 -0
{debase-0.1.1.dist-info → debase-0.1.3.dist-info}/entry_points.txt +0 -0
{debase-0.1.1.dist-info → debase-0.1.3.dist-info}/licenses/LICENSE +0 -0
{debase-0.1.1.dist-info → debase-0.1.3.dist-info}/top_level.txt +0 -0

debase/PIPELINE_FLOW.md ADDED Viewed

@@ -0,0 +1,100 @@
+# DEBase Pipeline Flow
+## Overview
+The DEBase pipeline extracts enzyme engineering data from chemistry papers through a series of modular steps.
+## Pipeline Architecture
+```
+┌─────────────────────┐     ┌─────────────────────┐
+│   Manuscript PDF    │     │       SI PDF        │
+└──────────┬──────────┘     └──────────┬──────────┘
+           │                           │
+           └───────────┬───────────────┘
+                       │
+                       ▼
+         ┌─────────────────────────────┐
+         │ 1. enzyme_lineage_extractor │
+         │   - Extract enzyme variants │
+         │   - Parse mutations         │
+         │   - Get basic metadata      │
+         └─────────────┬───────────────┘
+                       │
+                       ▼
+         ┌─────────────────────────────┐
+         │    2. cleanup_sequence      │
+         │   - Validate sequences      │
+         │   - Fix formatting issues   │
+         │   - Generate full sequences │
+         └─────────────┬───────────────┘
+                       │
+           ┌───────────┴───────────────┐
+           │                           │
+           ▼                           ▼
+┌─────────────────────────┐ ┌─────────────────────────┐
+│ 3a. reaction_info       │ │ 3b. substrate_scope     │
+│     _extractor          │ │     _extractor          │
+│ - Performance metrics   │ │ - Substrate variations  │
+│ - Model reaction        │ │ - Additional variants   │
+│ - Conditions            │ │ - Scope data            │
+└───────────┬─────────────┘ └───────────┬─────────────┘
+            │                           │
+            └───────────┬───────────────┘
+                        │
+                        ▼
+          ┌─────────────────────────────┐
+          │    4. lineage_format_o3     │
+          │   - Merge all data          │
+          │   - Fill missing sequences  │
+          │   - Format final output     │
+          └─────────────┬───────────────┘
+                        │
+                        ▼
+                ┌─────────────┐
+                │ Final CSV   │
+                └─────────────┘
+```
+## Module Details
+### 1. enzyme_lineage_extractor.py
+- **Input**: Manuscript PDF, SI PDF
+- **Output**: CSV with enzyme variants and mutations
+- **Function**: Extracts enzyme identifiers, mutation lists, and basic metadata
+### 2. cleanup_sequence.py
+- **Input**: Enzyme lineage CSV
+- **Output**: CSV with validated sequences
+- **Function**: Validates protein sequences, generates full sequences from mutations
+### 3a. reaction_info_extractor.py
+- **Input**: PDFs + cleaned enzyme CSV
+- **Output**: CSV with reaction performance data
+- **Function**: Extracts yield, TTN, selectivity, reaction conditions
+### 3b. substrate_scope_extractor.py
+- **Input**: PDFs + cleaned enzyme CSV
+- **Output**: CSV with substrate scope entries
+- **Function**: Extracts substrate variations tested with different enzymes
+### 4. lineage_format_o3.py
+- **Input**: Reaction CSV + Substrate scope CSV
+- **Output**: Final formatted CSV
+- **Function**: Merges data, fills missing sequences, applies consistent formatting
+## Key Features
+1. **Modular Design**: Each step can be run independently
+2. **Parallel Extraction**: Steps 3a and 3b run independently
+3. **Error Recovery**: Pipeline can resume from any step
+4. **Clean Interfaces**: Each module has well-defined inputs/outputs
+## Usage
+```bash
+# Full pipeline
+python -m debase.wrapper_clean manuscript.pdf --si si.pdf --output results.csv
+# With intermediate files kept for debugging
+python -m debase.wrapper_clean manuscript.pdf --si si.pdf --keep-intermediates
+```

debase/_version.py CHANGED Viewed

@@ -1,3 +1,3 @@
 """Version information."""
-__version__ = "0.1.1"
+__version__ = "0.1.3"

debase/enzyme_lineage_extractor.py CHANGED Viewed

@@ -1297,6 +1297,8 @@ _SEQUENCE_SCHEMA_HINT = """
 _SEQ_LOC_PROMPT = """
 Find where FULL-LENGTH protein or DNA sequences are located in this document.
+PRIORITY: Protein/amino acid sequences are preferred over DNA sequences.
 Look for table of contents entries or section listings that mention sequences.
 Return a JSON array where each element has:
 - "section": the section heading or description
@@ -1305,6 +1307,7 @@ Return a JSON array where each element has:
 Focus on:
 - Table of contents or entries about "Sequence Information" or "Nucleotide and amino acid sequences"
 - Return the EXACT notation as shown.
+- Prioritize sections that mention "protein" or "amino acid" sequences
 Return [] if no sequence sections are found.
 Absolutely don't include nucleotides or primer sequences, it is better to return nothing then incomplete sequence, use your best judgement.
@@ -1465,10 +1468,16 @@ def validate_sequence_locations(text: str, locations: list, model, *, pdf_paths:
 # --- 7.3  Main extraction prompt ---------------------------------------------
 _SEQ_EXTRACTION_PROMPT = """
 Extract EVERY distinct enzyme-variant sequence you can find in the text.
+IMPORTANT: Prioritize amino acid (protein) sequences over DNA sequences:
+- If an amino acid sequence exists for a variant, extract ONLY the aa_seq (set dna_seq to null)
+- Only extract dna_seq if NO amino acid sequence is available for that variant
+- This reduces redundancy since protein sequences are usually more relevant
 For each variant return:
   * variant_id  - the label used in the paper (e.g. "R4-10")
   * aa_seq      - amino-acid sequence (uppercase), or null
-  * dna_seq     - DNA sequence (A/C/G/T), or null
+  * dna_seq     - DNA sequence (A/C/G/T), or null (ONLY if no aa_seq exists)
 Respond ONLY with **minified JSON** that matches the schema below.
 NO markdown, no code fences, no commentary.

debase/reaction_info_extractor.py CHANGED Viewed

@@ -685,7 +685,7 @@ Ignore locations that contain data for other campaigns.
             'confidence': 95
         }
-    def find_lineage_model_reaction(self, location: str, group_context: str) -> Dict[str, Any]:
+    def find_lineage_model_reaction(self, location: str, group_context: str, model_reaction_locations: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
         """Find the model reaction for a specific lineage group."""
         # Gather relevant text near this location
         page_text = self._page_with_reference(location) or ""
@@ -693,6 +693,7 @@ Ignore locations that contain data for other campaigns.
         # Also check manuscript introduction for model reaction info
         intro_text = "\n\n".join(self.ms_pages[:3]) if self.ms_pages else ""
+        # Build the prompt with location and context
         prompt = PROMPT_FIND_LINEAGE_MODEL_REACTION.format(
             location=location,
             group_context=group_context
@@ -700,6 +701,22 @@ Ignore locations that contain data for other campaigns.
         prompt += f"\n\nText near {location}:\n{page_text[:3000]}"
         prompt += f"\n\nManuscript introduction:\n{intro_text[:3000]}"
+        # If we have model reaction locations, include text from those locations too
+        if model_reaction_locations:
+            # Add text from model reaction location
+            if model_reaction_locations.get("model_reaction_location", {}).get("location"):
+                model_loc = model_reaction_locations["model_reaction_location"]["location"]
+                model_text = self._get_text_around_location(model_loc)
+                if model_text:
+                    prompt += f"\n\nText from {model_loc} (potential model reaction location):\n{model_text[:3000]}"
+            # Add text from conditions location (often contains reaction details)
+            if model_reaction_locations.get("conditions_location", {}).get("location"):
+                cond_loc = model_reaction_locations["conditions_location"]["location"]
+                cond_text = self._get_text_around_location(cond_loc)
+                if cond_text:
+                    prompt += f"\n\nText from {cond_loc} (reaction conditions):\n{cond_text[:3000]}"
         try:
             data = generate_json_with_retry(
                 self.model,
@@ -1038,7 +1055,20 @@ Different campaigns may use different model reactions.
         """Extract text around a given location identifier."""
         location_lower = location.lower()
-        # Search in all pages
+        # Handle compound locations like "Figure 2 caption and Section I"
+        # Extract the first figure/table/scheme reference
+        figure_match = re.search(r"(figure|scheme|table)\s*\d+", location_lower)
+        if figure_match:
+            primary_location = figure_match.group(0)
+            # Try to find this primary location first
+            for page_text in self.all_pages:
+                if primary_location in page_text.lower():
+                    idx = page_text.lower().index(primary_location)
+                    start = max(0, idx - 500)
+                    end = min(len(page_text), idx + 3000)
+                    return page_text[start:end]
+        # Search in all pages for exact match
         for page_text in self.all_pages:
             if location_lower in page_text.lower():
                 # Find the location and extract context around it
@@ -1790,8 +1820,16 @@ TEXT FROM MANUSCRIPT:
             if location.get('caption'):
                 location_context += f"\nCaption: {location['caption']}"
-            # Try to find model reaction for this specific lineage
-            location_model_reaction = self.find_lineage_model_reaction(location['location'], location_context)
+            # First find model reaction locations for this campaign/enzyme group
+            location_enzymes = df_location['enzyme'].unique().tolist()
+            model_reaction_locations = self.find_model_reaction_locations(location_enzymes)
+            # Try to find model reaction for this specific lineage, passing the locations
+            location_model_reaction = self.find_lineage_model_reaction(
+                location['location'],
+                location_context,
+                model_reaction_locations
+            )
             # Get full model reaction info with IUPAC names
             if location_model_reaction.get('substrate_ids') or location_model_reaction.get('product_ids'):
@@ -1799,7 +1837,6 @@ TEXT FROM MANUSCRIPT:
             else:
                 # Fall back to general model reaction extraction
                 # Pass the enzyme variants from this location
-                location_enzymes = df_location['enzyme'].unique().tolist()
                 model_info = self.gather_model_reaction_info(location_enzymes)
             # Add model reaction info to all enzymes from this location
@@ -1891,7 +1928,16 @@ TEXT FROM MANUSCRIPT:
             if group.get('caption'):
                 location_context += f"\nCaption: {group['caption']}"
-            location_model_reaction = self.find_lineage_model_reaction(group_location, location_context)
+            # First find model reaction locations for this enzyme group
+            location_enzymes = df_location['enzyme'].unique().tolist() if 'enzyme' in df_location.columns else all_enzyme_ids
+            model_reaction_locations = self.find_model_reaction_locations(location_enzymes)
+            # Try to find model reaction for this specific lineage, passing the locations
+            location_model_reaction = self.find_lineage_model_reaction(
+                group_location,
+                location_context,
+                model_reaction_locations
+            )
             # Get full model reaction info with IUPAC names
             if location_model_reaction.get('substrate_ids') or location_model_reaction.get('product_ids'):
@@ -1899,7 +1945,6 @@ TEXT FROM MANUSCRIPT:
             else:
                 # Try to extract model reaction from this specific location
                 # Pass the enzyme variants that have data in this location
-                location_enzymes = df_location['enzyme'].unique().tolist() if 'enzyme' in df_location.columns else all_enzyme_ids
                 model_info = self.gather_model_reaction_info(location_enzymes)
             # Add model reaction info to all enzymes from this location

{debase-0.1.1.dist-info → debase-0.1.3.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: debase
-Version: 0.1.1
+Version: 0.1.3
 Summary: Enzyme lineage analysis and sequence extraction package
 Home-page: https://github.com/YuemingLong/DEBase
 Author: DEBase Team
@@ -64,13 +64,6 @@ Enzyme lineage analysis and sequence extraction package with advanced parallel p
 ```bash
 pip install debase
 ```
-For full functionality with chemical SMILES support:
-```bash
-pip install debase[rdkit]
-```
 ## Requirements
 - Python 3.8 or higher
@@ -139,13 +132,6 @@ debase --manuscript paper.pdf --si si.pdf --use-optimized-reaction --reaction-ba
 debase --manuscript paper.pdf --si si.pdf  # Default method
 ```
-## Performance Comparison
-| Method | Total Time | API Calls | Accuracy | Best For |
-|--------|------------|-----------|----------|----------|
-| Sequential | ~45 min | 44 calls | Highest | Small datasets |
-| **Parallel Individual** | **~12 min** | **44 calls** | **High** | **Recommended** |
-| Batch Processing | ~8 min | ~8 calls | Good | Speed-critical |
 ## Advanced Usage
@@ -169,31 +155,6 @@ python -m debase.substrate_scope_extractor_parallel \
   --manuscript paper.pdf --si si.pdf --lineage-csv lineage.csv \
   --max-workers 5 --output substrate_scope.csv
 ```
-## Python API
-```python
-from debase.wrapper import run_pipeline
-# Run full pipeline with parallel processing
-run_pipeline(
-    manuscript_path="paper.pdf",
-    si_path="si.pdf",
-    output="output.csv",
-    use_parallel_individual=True,
-    max_workers=5
-)
-# For individual steps
-from debase.reaction_info_extractor_parallel import extract_reaction_info_parallel
-from debase.enzyme_lineage_extractor import setup_gemini_api
-model = setup_gemini_api()
-reaction_data = extract_reaction_info_parallel(
-    model, manuscript_path, si_path, enzyme_csv_path, max_workers=5
-)
-```
 ## Pipeline Architecture
 The DEBase pipeline consists of 5 main steps:
@@ -222,9 +183,6 @@ The DEBase pipeline consists of 5 main steps:
 - **External database integration:** Automatic sequence fetching from PDB and UniProt
 - **AI-powered matching:** Uses Gemini to intelligently match database entries to enzyme variants
 - **Smart filtering:** Automatically excludes non-enzyme entries (buffers, controls, etc.)
-- **Progress tracking:** Real-time status updates
-- **Flexible output:** CSV format with comprehensive chemical and performance data
-- **Caching:** PDF encoding cache for improved performance
 - **Vision capabilities:** Extracts data from both text and images in PDFs
 ## Complete Command Reference
@@ -234,7 +192,6 @@ The DEBase pipeline consists of 5 main steps:
 --manuscript PATH           # Required: Path to manuscript PDF
 --si PATH                  # Optional: Path to supplementary information PDF
 --output PATH              # Output file path (default: manuscript_name_debase.csv)
---queries N                # Number of consensus queries (default: 2)
 ```
 ### Performance Options
@@ -279,21 +236,5 @@ The DEBase pipeline consists of 5 main steps:
 3. **Use batch processing** only when speed is critical and some accuracy loss is acceptable
 4. **Skip validation** (`--skip-validation`) for faster processing in production
 5. **Keep intermediates** (`--keep-intermediates`) for debugging and incremental runs
-6. **Check external databases** - Many sequences can be automatically fetched from PDB/UniProt
-7. **Verify enzyme entries** - The system automatically filters out buffers and controls
-## Troubleshooting
-### No sequences found
-- The extractor will automatically search PDB and UniProt databases
-- Check the logs for which database IDs were found and attempted
-- Sequences with PDB structures will be fetched with high confidence
-### Incorrect enzyme extraction
-- Non-enzyme entries (buffers, controls, media) are automatically filtered
-- Check the log for entries marked as "Filtering out non-enzyme entry"
+6.
-### PDB matching issues
-- The system uses AI to match PDB IDs to specific enzyme variants
-- Increased context extraction ensures better matching accuracy
-- Check logs for "Gemini PDB matching" entries to see the matching process

debase-0.1.3.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,17 @@
+debase/PIPELINE_FLOW.md,sha256=S4nQyZlX39-Bchw1gQWPK60sHiFpB1eWHqo5GR9oTY8,4741
+debase/__init__.py,sha256=YeKveGj_8fwuu5ozoK2mUU86so_FjiCwsvg1d_lYVZU,586
+debase/__main__.py,sha256=LbxYt2x9TG5Ced7LpzzX_8gkWyXeZSlVHzqHfqAiPwQ,160
+debase/_version.py,sha256=92QgGO0ZoG0AhULGdcMTX2RSEJkv8UZrDw2peYQOh4U,49
+debase/build_db.py,sha256=bW574GxsL1BJtDwM19urLbciPcejLzfraXZPpzm09FQ,7167
+debase/cleanup_sequence.py,sha256=QyhUqvTBVFTGM7ebAHmP3tif3Jq-8hvoLApYwAJtpH4,32702
+debase/enzyme_lineage_extractor.py,sha256=sJ9Lz7Usse5NqdoZatoOEDMwbMYEgNH1HCLIGS9avn8,87774
+debase/lineage_format.py,sha256=mACni9M1RXA_1tIyDZJpStQoutd_HLG2qQMAORTusZs,30045
+debase/reaction_info_extractor.py,sha256=6wWj4IyUNSugNjxpwMGjABSAp68yHABaz_7ZRjh9GEk,112162
+debase/substrate_scope_extractor.py,sha256=dbve8q3K7ggA3A6EwB-KK9L19BnMNgPZMZ05G937dSY,82262
+debase/wrapper.py,sha256=lTx375a57EVuXcZ_roXaj5UDj8HjRcb5ViNaSgPN4Ik,10352
+debase-0.1.3.dist-info/licenses/LICENSE,sha256=5sk9_tcNmr1r2iMIUAiioBo7wo38u8BrPlO7f0seqgE,1075
+debase-0.1.3.dist-info/METADATA,sha256=WUJha43ZPKgGDNZD1DYu8CfJwxUOj09kzuPSqfwe96s,9382
+debase-0.1.3.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
+debase-0.1.3.dist-info/entry_points.txt,sha256=hUcxA1b4xORu-HHBFTe9u2KTdbxPzt0dwz95_6JNe9M,48
+debase-0.1.3.dist-info/top_level.txt,sha256=2BUeq-4kmQr0Rhl06AnRzmmZNs8WzBRK9OcJehkcdk8,7
+debase-0.1.3.dist-info/RECORD,,

debase-0.1.1.dist-info/RECORD DELETED Viewed

@@ -1,16 +0,0 @@
-debase/__init__.py,sha256=YeKveGj_8fwuu5ozoK2mUU86so_FjiCwsvg1d_lYVZU,586
-debase/__main__.py,sha256=LbxYt2x9TG5Ced7LpzzX_8gkWyXeZSlVHzqHfqAiPwQ,160
-debase/_version.py,sha256=f_aADPF4S4TQJIdnkbAgxIqnOWgZS6TJ3X9EDBZt_OM,49
-debase/build_db.py,sha256=bW574GxsL1BJtDwM19urLbciPcejLzfraXZPpzm09FQ,7167
-debase/cleanup_sequence.py,sha256=QyhUqvTBVFTGM7ebAHmP3tif3Jq-8hvoLApYwAJtpH4,32702
-debase/enzyme_lineage_extractor.py,sha256=1GcgHA-lQPRf9-bNDlvQIP8p-KsP3D2WhIuOtCVJ_ME,87276
-debase/lineage_format.py,sha256=mACni9M1RXA_1tIyDZJpStQoutd_HLG2qQMAORTusZs,30045
-debase/reaction_info_extractor.py,sha256=euw-4NHFuOPxpF99PJxTMLYYG0WryBDUCpoANB-SPPM,109655
-debase/substrate_scope_extractor.py,sha256=dbve8q3K7ggA3A6EwB-KK9L19BnMNgPZMZ05G937dSY,82262
-debase/wrapper.py,sha256=lTx375a57EVuXcZ_roXaj5UDj8HjRcb5ViNaSgPN4Ik,10352
-debase-0.1.1.dist-info/licenses/LICENSE,sha256=5sk9_tcNmr1r2iMIUAiioBo7wo38u8BrPlO7f0seqgE,1075
-debase-0.1.1.dist-info/METADATA,sha256=GI8WvSNVIllw_ZKLqlhy-rqtVHBun3ZG1hahEvO_BMo,11509
-debase-0.1.1.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
-debase-0.1.1.dist-info/entry_points.txt,sha256=hUcxA1b4xORu-HHBFTe9u2KTdbxPzt0dwz95_6JNe9M,48
-debase-0.1.1.dist-info/top_level.txt,sha256=2BUeq-4kmQr0Rhl06AnRzmmZNs8WzBRK9OcJehkcdk8,7
-debase-0.1.1.dist-info/RECORD,,

{debase-0.1.1.dist-info → debase-0.1.3.dist-info}/WHEEL RENAMED Viewed

File without changes

{debase-0.1.1.dist-info → debase-0.1.3.dist-info}/entry_points.txt RENAMED Viewed

File without changes

{debase-0.1.1.dist-info → debase-0.1.3.dist-info}/licenses/LICENSE RENAMED Viewed

File without changes

{debase-0.1.1.dist-info → debase-0.1.3.dist-info}/top_level.txt RENAMED Viewed

File without changes

debase 0.1.1__py3-none-any.whl → 0.1.3__py3-none-any.whl

debase 0.1.1py3-none-any.whl → 0.1.3py3-none-any.whl