PyPI - debase - Versions diffs - 0.1.16__py3-none-any.whl → 0.1.18__py3-none-any.whl - Mend

debase 0.1.16py3-none-any.whl → 0.1.18py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

debase/PIPELINE_FLOW.md +100 -0
debase/_version.py +1 -1
debase/enzyme_lineage_extractor.py +251 -13
debase/lineage_format.py +113 -11
debase/reaction_info_extractor.py +21 -6
debase/wrapper.py +301 -67
{debase-0.1.16.dist-info → debase-0.1.18.dist-info}/METADATA +1 -1
debase-0.1.18.dist-info/RECORD +17 -0
debase-0.1.16.dist-info/RECORD +0 -16
{debase-0.1.16.dist-info → debase-0.1.18.dist-info}/WHEEL +0 -0
{debase-0.1.16.dist-info → debase-0.1.18.dist-info}/entry_points.txt +0 -0
{debase-0.1.16.dist-info → debase-0.1.18.dist-info}/licenses/LICENSE +0 -0
{debase-0.1.16.dist-info → debase-0.1.18.dist-info}/top_level.txt +0 -0

debase/PIPELINE_FLOW.md ADDED Viewed

@@ -0,0 +1,100 @@
+# DEBase Pipeline Flow
+## Overview
+The DEBase pipeline extracts enzyme engineering data from chemistry papers through a series of modular steps.
+## Pipeline Architecture
+```
+┌─────────────────────┐     ┌─────────────────────┐
+│   Manuscript PDF    │     │       SI PDF        │
+└──────────┬──────────┘     └──────────┬──────────┘
+           │                           │
+           └───────────┬───────────────┘
+                       │
+                       ▼
+         ┌─────────────────────────────┐
+         │ 1. enzyme_lineage_extractor │
+         │   - Extract enzyme variants │
+         │   - Parse mutations         │
+         │   - Get basic metadata      │
+         └─────────────┬───────────────┘
+                       │
+                       ▼
+         ┌─────────────────────────────┐
+         │    2. cleanup_sequence      │
+         │   - Validate sequences      │
+         │   - Fix formatting issues   │
+         │   - Generate full sequences │
+         └─────────────┬───────────────┘
+                       │
+           ┌───────────┴───────────────┐
+           │                           │
+           ▼                           ▼
+┌─────────────────────────┐ ┌─────────────────────────┐
+│ 3a. reaction_info       │ │ 3b. substrate_scope     │
+│     _extractor          │ │     _extractor          │
+│ - Performance metrics   │ │ - Substrate variations  │
+│ - Model reaction        │ │ - Additional variants   │
+│ - Conditions            │ │ - Scope data            │
+└───────────┬─────────────┘ └───────────┬─────────────┘
+            │                           │
+            └───────────┬───────────────┘
+                        │
+                        ▼
+          ┌─────────────────────────────┐
+          │    4. lineage_format_o3     │
+          │   - Merge all data          │
+          │   - Fill missing sequences  │
+          │   - Format final output     │
+          └─────────────┬───────────────┘
+                        │
+                        ▼
+                ┌─────────────┐
+                │ Final CSV   │
+                └─────────────┘
+```
+## Module Details
+### 1. enzyme_lineage_extractor.py
+- **Input**: Manuscript PDF, SI PDF
+- **Output**: CSV with enzyme variants and mutations
+- **Function**: Extracts enzyme identifiers, mutation lists, and basic metadata
+### 2. cleanup_sequence.py
+- **Input**: Enzyme lineage CSV
+- **Output**: CSV with validated sequences
+- **Function**: Validates protein sequences, generates full sequences from mutations
+### 3a. reaction_info_extractor.py
+- **Input**: PDFs + cleaned enzyme CSV
+- **Output**: CSV with reaction performance data
+- **Function**: Extracts yield, TTN, selectivity, reaction conditions
+### 3b. substrate_scope_extractor.py
+- **Input**: PDFs + cleaned enzyme CSV
+- **Output**: CSV with substrate scope entries
+- **Function**: Extracts substrate variations tested with different enzymes
+### 4. lineage_format_o3.py
+- **Input**: Reaction CSV + Substrate scope CSV
+- **Output**: Final formatted CSV
+- **Function**: Merges data, fills missing sequences, applies consistent formatting
+## Key Features
+1. **Modular Design**: Each step can be run independently
+2. **Parallel Extraction**: Steps 3a and 3b run independently
+3. **Error Recovery**: Pipeline can resume from any step
+4. **Clean Interfaces**: Each module has well-defined inputs/outputs
+## Usage
+```bash
+# Full pipeline
+python -m debase.wrapper_clean manuscript.pdf --si si.pdf --output results.csv
+# With intermediate files kept for debugging
+python -m debase.wrapper_clean manuscript.pdf --si si.pdf --keep-intermediates
+```

debase/_version.py CHANGED Viewed

@@ -1,3 +1,3 @@
 """Version information."""
-__version__ = "0.1.16"
+__version__ = "0.1.18"

debase/enzyme_lineage_extractor.py CHANGED Viewed

@@ -823,7 +823,14 @@ def identify_evolution_locations(
 def _parse_variants(data: Dict[str, Any], campaign_id: Optional[str] = None) -> List[Variant]:
     """Convert raw JSON to a list[Variant] with basic validation."""
-    variants_json = data.get("variants", []) if isinstance(data, dict) else []
+    if isinstance(data, list):
+        # Direct array of variants
+        variants_json = data
+    elif isinstance(data, dict):
+        # Object with "variants" key
+        variants_json = data.get("variants", [])
+    else:
+        variants_json = []
     parsed: List[Variant] = []
     for item in variants_json:
         try:
@@ -1283,13 +1290,40 @@ def get_lineage(
         log.info(f"Identified {len(campaigns)} distinct campaigns")
         for camp in campaigns:
             log.info(f"  - {camp.campaign_name}: {camp.description}")
+    else:
+        log.warning("No campaigns identified, creating default campaign for enzyme characterization")
+        # Create a default campaign when none are found
+        default_campaign = Campaign(
+            campaign_id="default_characterization",
+            campaign_name="Enzyme Characterization Study",
+            description="Default campaign for papers that characterize existing enzyme variants without describing new directed evolution",
+            model_substrate="Unknown",
+            model_product="Unknown",
+            data_locations=["Full manuscript text"]
+        )
+        campaigns = [default_campaign]
+        log.info(f"Created default campaign: {default_campaign.campaign_name}")
     # Use captions for identification - they're concise and focused
     locations = identify_evolution_locations(caption_text, model, debug_dir=debug_dir, campaigns=None, pdf_paths=pdf_paths)
     all_variants = []
-    if locations and campaigns:
+    if campaigns:
+        # If we have campaigns but no specific locations, use general extraction
+        if not locations:
+            log.info("No specific lineage locations found, extracting from full text with campaign context")
+            # Extract lineage for each campaign using full text
+            for campaign in campaigns:
+                log.info(f"Processing campaign: {campaign.campaign_id}")
+                campaign_variants = extract_campaign_lineage(
+                    full_text, model, campaign_id=campaign.campaign_id,
+                    debug_dir=debug_dir, pdf_paths=pdf_paths,
+                    campaign_info=campaign
+                )
+                all_variants.extend(campaign_variants)
+            return all_variants, campaigns
+        # Original logic for when we have both locations and campaigns
         # Log location information
         location_summary = []
         for loc in locations[:5]:
@@ -1939,6 +1973,173 @@ def fetch_pdb_sequences(pdb_id: str) -> Dict[str, str]:
         log.warning(f"Failed to fetch PDB {pdb_id}: {e}")
         return {}
+def extract_enzyme_info_with_gemini(
+    text: str,
+    variants: List[Variant],
+    model,
+) -> Dict[str, str]:
+    """Use Gemini to extract enzyme names or sequences when PDB IDs are not available.
+    Returns:
+        Dict mapping variant IDs to sequences
+    """
+    # Build variant info for context
+    variant_info = []
+    for v in variants[:10]:  # Limit to first 10 variants for context
+        info = {
+            "id": v.variant_id,
+            "mutations": v.mutations[:5] if v.mutations else [],  # Limit mutations shown
+            "parent": v.parent_id,
+            "generation": v.generation
+        }
+        variant_info.append(info)
+    prompt = f"""You are analyzing a scientific paper about enzyme engineering. No PDB IDs were found in the paper, and I need to obtain protein sequences for the enzyme variants described.
+Here are the variants found in the paper:
+{json.dumps(variant_info, indent=2)}
+Please analyze the paper text and:
+1. Identify the common name of the enzyme being studied (e.g., "P450 BM3", "cytochrome P450 BM3", "CYP102A1")
+2. If possible, extract or find the wild-type sequence
+3. Provide any UniProt IDs or accession numbers mentioned
+Paper text (first 5000 characters):
+{text[:5000]}
+Return your response as a JSON object with this structure:
+{{
+    "enzyme_name": "common name of the enzyme",
+    "systematic_name": "systematic name if applicable (e.g., CYP102A1)",
+    "uniprot_id": "UniProt ID if found",
+    "wild_type_sequence": "sequence if found in paper or if you know it",
+    "additional_names": ["list", "of", "alternative", "names"]
+}}
+If you cannot determine certain fields, set them to null.
+"""
+    try:
+        response = model.generate_content(prompt)
+        text_response = _extract_text(response).strip()
+        # Parse JSON response
+        if text_response.startswith("```"):
+            text_response = text_response.split("```")[1].strip()
+            if text_response.startswith("json"):
+                text_response = text_response[4:].strip()
+            text_response = text_response.split("```")[0].strip()
+        enzyme_info = json.loads(text_response)
+        log.info(f"Gemini extracted enzyme info: {enzyme_info.get('enzyme_name', 'Unknown')}")
+        sequences = {}
+        # If Gemini provided a sequence directly, use it
+        if enzyme_info.get("wild_type_sequence"):
+            # Clean the sequence
+            seq = enzyme_info["wild_type_sequence"].upper().replace(" ", "").replace("\n", "")
+            # Validate it looks like a protein sequence
+            if seq and all(c in "ACDEFGHIKLMNPQRSTVWY" for c in seq) and len(seq) > 50:
+                # Map to the first variant or wild-type
+                wt_variant = next((v for v in variants if "WT" in v.variant_id.upper() or v.generation == 0), None)
+                if wt_variant:
+                    sequences[wt_variant.variant_id] = seq
+                else:
+                    sequences[variants[0].variant_id] = seq
+                log.info(f"Using sequence from Gemini: {len(seq)} residues")
+        # If no sequence but we have names, try to fetch from UniProt
+        if not sequences:
+            names_to_try = []
+            if enzyme_info.get("enzyme_name"):
+                names_to_try.append(enzyme_info["enzyme_name"])
+            if enzyme_info.get("systematic_name"):
+                names_to_try.append(enzyme_info["systematic_name"])
+            if enzyme_info.get("uniprot_id"):
+                names_to_try.append(enzyme_info["uniprot_id"])
+            if enzyme_info.get("additional_names"):
+                names_to_try.extend(enzyme_info["additional_names"])
+            # Try each name with UniProt
+            for name in names_to_try:
+                if name:
+                    uniprot_seqs = fetch_sequence_by_name(name)
+                    if uniprot_seqs:
+                        # Map the first sequence to appropriate variant
+                        seq = list(uniprot_seqs.values())[0]
+                        wt_variant = next((v for v in variants if "WT" in v.variant_id.upper() or v.generation == 0), None)
+                        if wt_variant:
+                            sequences[wt_variant.variant_id] = seq
+                        else:
+                            sequences[variants[0].variant_id] = seq
+                        log.info(f"Found sequence via UniProt search for '{name}': {len(seq)} residues")
+                        break
+        return sequences
+    except Exception as e:
+        log.warning(f"Failed to extract enzyme info with Gemini: {e}")
+        return {}
+def fetch_sequence_by_name(enzyme_name: str) -> Dict[str, str]:
+    """Fetch protein sequences from UniProt by enzyme name or ID.
+    Args:
+        enzyme_name: Name, ID, or accession of the enzyme
+    Returns:
+        Dict mapping identifiers to sequences
+    """
+    import requests
+    clean_name = enzyme_name.strip()
+    # First try as accession number
+    if len(clean_name) <= 10 and (clean_name[0].isalpha() and clean_name[1:].replace("_", "").isalnum()):
+        # Looks like a UniProt accession
+        url = f"https://rest.uniprot.org/uniprotkb/{clean_name}"
+        try:
+            response = requests.get(url, timeout=10)
+            if response.status_code == 200:
+                data = response.json()
+                sequence = data.get('sequence', {}).get('value', '')
+                if sequence:
+                    return {clean_name: sequence}
+        except:
+            pass
+    # Try search API
+    url = "https://rest.uniprot.org/uniprotkb/search"
+    params = {
+        "query": f'(protein_name:"{clean_name}" OR gene:"{clean_name}" OR id:"{clean_name}")',
+        "format": "json",
+        "size": "5",
+        "fields": "accession,id,protein_name,gene_names,sequence"
+    }
+    try:
+        response = requests.get(url, params=params, timeout=10)
+        response.raise_for_status()
+        data = response.json()
+        results = data.get('results', [])
+        sequences = {}
+        for result in results[:1]:  # Just take the first match
+            sequence = result.get('sequence', {}).get('value', '')
+            if sequence:
+                sequences[clean_name] = sequence
+                break
+        return sequences
+    except Exception as e:
+        log.warning(f"Failed to fetch sequence for '{enzyme_name}': {e}")
+        return {}
 def match_pdb_to_variants(
     pdb_sequences: Dict[str, str],
     variants: List[Variant],
@@ -2110,16 +2311,23 @@ def _merge_lineage_and_sequences(
         for v in lineage
     ])
-    df_seq = pd.DataFrame([
-        {
-            "variant_id": s.variant_id,
-            "aa_seq": s.aa_seq,
-            "dna_seq": s.dna_seq,
-            "seq_confidence": s.confidence,
-            "truncated": s.truncated,
-        }
-        for s in seqs
-    ])
+    if seqs:
+        df_seq = pd.DataFrame([
+            {
+                "variant_id": s.variant_id,
+                "aa_seq": s.aa_seq,
+                "dna_seq": s.dna_seq,
+                "seq_confidence": s.confidence,
+                "truncated": s.truncated,
+                "seq_source": s.metadata.get("source", None) if s.metadata else None,
+            }
+            for s in seqs
+        ])
+    else:
+        # Create empty DataFrame with correct columns for merging
+        df_seq = pd.DataFrame(columns=[
+            "variant_id", "aa_seq", "dna_seq", "seq_confidence", "truncated", "seq_source"
+        ])
     # Log sequence data info
     if len(df_seq) > 0:
@@ -2397,7 +2605,7 @@ def run_pipeline(
         early_df = _lineage_to_dataframe(lineage)
         output_csv_path = Path(output_csv)
         # Save lineage-only data with specific filename
-        lineage_path = output_csv_path.parent / "enzyme_lineage_data.csv"
+        lineage_path = output_csv_path.parent / "enzyme_lineage_name.csv"
         early_df.to_csv(lineage_path, index=False)
         log.info(
             "Saved lineage-only CSV -> %s",
@@ -2461,6 +2669,36 @@ def run_pipeline(
                     log.warning(f"No sequences found in PDB {pdb_id}")
         else:
             log.warning("No PDB IDs found in paper")
+        # 4b. If still no sequences, try Gemini extraction as last resort
+        if not sequences or all(not s.aa_seq for s in sequences):
+            log.info("No sequences from PDB, attempting Gemini-based extraction...")
+            gemini_sequences = extract_enzyme_info_with_gemini(full_text, lineage, model)
+            if gemini_sequences:
+                # Convert to SequenceBlock objects
+                gemini_seq_blocks = []
+                for variant_id, seq in gemini_sequences.items():
+                    # Find the matching variant
+                    variant = next((v for v in lineage if v.variant_id == variant_id), None)
+                    if variant:
+                        seq_block = SequenceBlock(
+                            variant_id=variant.variant_id,
+                            aa_seq=seq,
+                            dna_seq=None,
+                            confidence=0.9,  # High confidence but slightly lower than PDB
+                            truncated=False,
+                            metadata={"source": "Gemini/UniProt"}
+                        )
+                        gemini_seq_blocks.append(seq_block)
+                        log.info(f"Added sequence for {variant.variant_id} via Gemini/UniProt: {len(seq)} residues")
+                if gemini_seq_blocks:
+                    sequences = gemini_seq_blocks
+                    log.info(f"Successfully extracted {len(gemini_seq_blocks)} sequences via Gemini")
+            else:
+                log.warning("Failed to extract sequences via Gemini")
     # 5. Merge & score (Section 8) --------------------------------------------
     doi = extract_doi(manuscript)

debase/lineage_format.py CHANGED Viewed

@@ -188,11 +188,17 @@ class VariantRecord:
     # Reaction-related -------------------------------------------------------------
     def substrate_iupac(self) -> List[str]:
         raw = str(self.row.get("substrate_iupac_list", "")).strip()
-        return _split_list(raw)
+        result = _split_list(raw)
+        if not result and raw and raw.lower() != 'nan':
+            log.debug(f"substrate_iupac_list for {self.eid}: raw='{raw}', parsed={result}")
+        return result
     def product_iupac(self) -> List[str]:
         raw = str(self.row.get("product_iupac_list", "")).strip()
-        return _split_list(raw)
+        result = _split_list(raw)
+        if not result and raw and raw.lower() != 'nan':
+            log.debug(f"product_iupac_list for {self.eid}: raw='{raw}', parsed={result}")
+        return result
     def ttn_or_yield(self) -> Optional[float]:
@@ -377,6 +383,53 @@ def _nt_mut(parent_aa: str, child_aa: str, parent_nt: str = "", child_nt: str =
 # === 6. SMILES CONVERSION HELPERS ==================================================
+def search_smiles_with_gemini(compound_name: str, model=None) -> Optional[str]:
+    """
+    Use Gemini to search for SMILES strings of complex compounds.
+    Returns SMILES string if found, None otherwise.
+    """
+    if not compound_name or compound_name.lower() in ['nan', 'none', '']:
+        return None
+    if not model:
+        try:
+            # Import get_model from enzyme_lineage_extractor
+            import sys
+            from pathlib import Path
+            sys.path.append(str(Path(__file__).parent))
+            from enzyme_lineage_extractor import get_model
+            model = get_model()
+        except Exception as e:
+            log.warning(f"Could not load Gemini model: {e}")
+            return None
+    prompt = f"""Search for the SMILES string representation of this chemical compound:
+"{compound_name}"
+IMPORTANT:
+- Do NOT generate or create a SMILES string
+- Only provide SMILES that you can find in chemical databases or literature
+- For deuterated compounds, search for the specific isotope-labeled SMILES
+- If you cannot find the exact SMILES, say "NOT FOUND"
+Return ONLY the SMILES string if found, or "NOT FOUND" if not found.
+No explanation or additional text."""
+    try:
+        response = model.generate_content(prompt)
+        result = response.text.strip()
+        if result and result != "NOT FOUND" and not result.startswith("I"):
+            # Basic validation that it looks like SMILES
+            if any(c in result for c in ['C', 'c', 'N', 'O', 'S', 'P', '[', ']', '(', ')']):
+                log.info(f"Gemini found SMILES for '{compound_name}': {result}")
+                return result
+        return None
+    except Exception as e:
+        log.debug(f"Gemini SMILES search failed for '{compound_name}': {e}")
+        return None
 def _split_list(raw: str) -> List[str]:
     if not raw or str(raw).lower() == 'nan':
         return []
@@ -429,7 +482,12 @@ def _name_to_smiles(name: str, is_substrate: bool) -> str:
     except FileNotFoundError:
         pass  # OPSIN not installed
-    # 3. PubChem PUG REST (online) ---------------------------------------------
+    # 3. Gemini search (for complex compounds) ---------------------------------
+    gemini_smiles = search_smiles_with_gemini(name)
+    if gemini_smiles:
+        return gemini_smiles
+    # 4. PubChem PUG REST (online) ---------------------------------------------
     try:
         import requests
@@ -538,13 +596,23 @@ def _root_enzyme_id(eid: str, idmap: Dict[str, Dict[str, str]], lineage_roots: D
 def _generate_lineage_roots(df: pd.DataFrame) -> Dict[str, str]:
     """Infer lineage roots using generation numbers and simple sequence similarity."""
-    idmap: Dict[str, Dict[str, str]] = {str(r["enzyme_id"]): r for _, r in df.iterrows()}
+    # Create idmap, handling missing enzyme_id gracefully
+    idmap: Dict[str, Dict[str, str]] = {}
+    for _, r in df.iterrows():
+        eid = r.get("enzyme_id")
+        if pd.isna(eid) or str(eid).strip() == "":
+            continue
+        idmap[str(eid)] = r
     roots: Dict[str, str] = {}
     # Look for generation 0 as the root
-    gen0 = {r["enzyme_id"] for _, r in df.iterrows() if str(r.get("generation", "")).strip() == "0"}
+    gen0 = {r["enzyme_id"] for _, r in df.iterrows()
+            if str(r.get("generation", "")).strip() == "0"
+            and not pd.isna(r.get("enzyme_id"))}
     # If no gen0 found, fall back to gen1
     if not gen0:
-        gen0 = {r["enzyme_id"] for _, r in df.iterrows() if str(r.get("generation", "")).strip() == "1"}
+        gen0 = {r["enzyme_id"] for _, r in df.iterrows()
+                if str(r.get("generation", "")).strip() == "1"
+                and not pd.isna(r.get("enzyme_id"))}
     def _seq_sim(a: str, b: str) -> float:
         if not a or not b:
@@ -553,7 +621,9 @@ def _generate_lineage_roots(df: pd.DataFrame) -> Dict[str, str]:
         return matches / max(len(a), len(b))
     for _, row in df.iterrows():
-        eid = row["enzyme_id"]
+        eid = row.get("enzyme_id")
+        if pd.isna(eid) or str(eid).strip() == "":
+            continue
         if eid in gen0:
             roots[eid] = eid
             continue
@@ -593,6 +663,9 @@ def _generate_lineage_roots(df: pd.DataFrame) -> Dict[str, str]:
 def flatten_dataframe(df: pd.DataFrame) -> pd.DataFrame:
     """Main public API: returns a DataFrame in the flat output format."""
+    log.info(f"Starting flatten_dataframe with {len(df)} input rows")
+    log.info(f"Input columns: {list(df.columns)}")
     # Apply column aliases to the dataframe
     for alias, canonical in COLUMN_ALIASES.items():
         if alias in df.columns and canonical not in df.columns:
@@ -621,8 +694,29 @@ def flatten_dataframe(df: pd.DataFrame) -> pd.DataFrame:
     # _save_pickle(SUBSTRATE_CACHE, SUBSTRATE_CACHE_FILE)
     # 3. Flatten rows ---------------------------------------------------------
-    idmap = {str(r["enzyme_id"]): r.to_dict() for _, r in df.iterrows()}
+    # Create idmap for parent lookups, but note this will only keep last occurrence of duplicates
+    idmap = {}
+    for _, r in df.iterrows():
+        eid = str(r["enzyme_id"])
+        if eid in idmap:
+            log.debug(f"Overwriting duplicate enzyme_id in idmap: {eid}")
+        idmap[eid] = r.to_dict()
+    # Check for duplicate enzyme_ids
+    enzyme_ids = [str(r["enzyme_id"]) for _, r in df.iterrows()]
+    unique_ids = set(enzyme_ids)
+    if len(enzyme_ids) != len(unique_ids):
+        log.warning(f"Found duplicate enzyme_ids! Total: {len(enzyme_ids)}, Unique: {len(unique_ids)}")
+        from collections import Counter
+        id_counts = Counter(enzyme_ids)
+        duplicates = {k: v for k, v in id_counts.items() if v > 1}
+        log.warning(f"Duplicate enzyme_ids: {duplicates}")
+        log.info("Note: All rows will still be processed, but parent lookups may use the last occurrence of duplicate IDs")
     output_rows: List[Dict[str, str]] = []
+    skipped_count = 0
+    processed_count = 0
     for idx, (_, row) in enumerate(df.iterrows()):
         rec = VariantRecord(row.to_dict())
         eid = rec.eid
@@ -632,13 +726,19 @@ def flatten_dataframe(df: pd.DataFrame) -> pd.DataFrame:
         prods = rec.product_iupac()
         data_type = rec.row.get("data_type", "")
-        if not subs or not prods:
-            # Skip entries without reaction info unless it's marked as lineage only
+        if not prods:
+            # Skip entries without product info unless it's marked as lineage only
             if data_type == "lineage":
                 subs, prods = [""], [""]  # placeholders
             else:
-                log.debug("Skipping %s due to missing reaction data", eid)
+                log.info(f"Skipping enzyme_id={eid} (row {idx}) due to missing product data. prods={prods}, data_type={data_type}")
+                skipped_count += 1
                 continue
+        # If no substrates but we have products, use empty substrate list
+        if not subs:
+            log.debug(f"Empty substrate list for enzyme_id={eid}, using empty placeholder")
+            subs = [""]
         sub_smiles = [sub_cache.get(s, "") for s in subs]
         prod_smiles = [prod_cache.get(p, "") for p in prods]
@@ -712,7 +812,9 @@ def flatten_dataframe(df: pd.DataFrame) -> pd.DataFrame:
             additional_information=additional_information,
         )
         output_rows.append(flat.as_dict())
+        processed_count += 1
+    log.info(f"Flattening complete: {processed_count} rows processed, {skipped_count} rows skipped")
     out_df = pd.DataFrame(output_rows, columns=OUTPUT_COLUMNS)
     return out_df

debase/reaction_info_extractor.py CHANGED Viewed

@@ -761,6 +761,15 @@ Ignore locations that contain data for other campaigns.
                 return line
         return page[:800]
+    def _ensure_rgb_pixmap(self, pix: fitz.Pixmap) -> fitz.Pixmap:
+        """Ensure pixmap is in RGB colorspace for PIL compatibility."""
+        if pix.alpha:  # RGBA -> RGB
+            pix = fitz.Pixmap(fitz.csRGB, pix)
+        elif pix.colorspace and pix.colorspace.name not in ["DeviceRGB", "DeviceGray"]:
+            # Convert unsupported colorspaces (CMYK, LAB, etc.) to RGB
+            pix = fitz.Pixmap(fitz.csRGB, pix)
+        return pix
     # ---- NEW: Page image helper for both figures and tables ----
     def _extract_page_png(self, ref: str, extract_figure_only: bool = True) -> Optional[str]:
         """Export the page containing the reference as PNG.
@@ -802,14 +811,14 @@ Ignore locations that contain data for other campaigns.
                             if img_rect.y1 < cap_rect.y0:  # fully above caption
                                 # Extract image bytes
                                 pix = fitz.Pixmap(doc, xref)
-                                if pix.alpha:  # RGBA -> RGB
-                                    pix = fitz.Pixmap(fitz.csRGB, pix)
+                                pix = self._ensure_rgb_pixmap(pix)
                                 img_bytes = pix.tobytes("png")
                                 return b64encode(img_bytes).decode()
                 else:
                     # Extract the entire page as an image
                     mat = fitz.Matrix(2.0, 2.0)  # 2x zoom for better quality
                     pix = page.get_pixmap(matrix=mat)
+                    pix = self._ensure_rgb_pixmap(pix)
                     img_bytes = pix.tobytes("png")
                     return b64encode(img_bytes).decode()
         return None
@@ -842,11 +851,13 @@ Ignore locations that contain data for other campaigns.
             # Add the current page
             mat = fitz.Matrix(2.0, 2.0)  # 2x zoom for better quality
             pix = doc.load_page(page_num).get_pixmap(matrix=mat)
+            pix = self._ensure_rgb_pixmap(pix)
             all_images.append(pix)
             # If this is the last page with the reference, also add the next page
             if i == len(pages) - 1 and page_num + 1 < doc.page_count:
                 next_pix = doc.load_page(page_num + 1).get_pixmap(matrix=mat)
+                next_pix = self._ensure_rgb_pixmap(next_pix)
                 all_images.append(next_pix)
                 LOGGER.info(f"Added next page: page {page_num + 2}")  # +2 because page numbers are 1-based for users
@@ -855,14 +866,16 @@ Ignore locations that contain data for other campaigns.
         # If only one page, return it directly
         if len(all_images) == 1:
-            return b64encode(all_images[0].tobytes("png")).decode()
+            pix = self._ensure_rgb_pixmap(all_images[0])
+            return b64encode(pix.tobytes("png")).decode()
         # Combine multiple pages vertically
         if not all_images:
             return None
         if len(all_images) == 1:
-            return b64encode(all_images[0].tobytes("png")).decode()
+            pix = self._ensure_rgb_pixmap(all_images[0])
+            return b64encode(pix.tobytes("png")).decode()
         # Calculate dimensions for combined image
         total_height = sum(pix.height for pix in all_images)
@@ -903,6 +916,7 @@ Ignore locations that contain data for other campaigns.
         # Convert the page to a pixmap
         mat = fitz.Matrix(2.0, 2.0)  # 2x zoom for quality
         combined_pix = page.get_pixmap(matrix=mat)
+        combined_pix = self._ensure_rgb_pixmap(combined_pix)
         # Convert to PNG and return
         img_bytes = combined_pix.tobytes("png")
@@ -947,8 +961,9 @@ Ignore locations that contain data for other campaigns.
             LOGGER.info("Gemini Vision: extracting metrics for %d enzymes from %s…", len(enzyme_list), ref)
             tag = f"extract_metrics_batch_vision"
         else:
-            # Add enzyme names to prompt for batch extraction
-            prompt = campaign_context + PROMPT_EXTRACT_METRICS + f"\n\nExtract performance data for ALL these enzyme variants:\n{enzyme_names}\n\n=== CONTEXT ===\n" + snippet[:4000]
+            # Add enzyme names to prompt for batch extraction with explicit format requirement
+            format_example = '{"enzyme1": {"yield": "99.0%", "ttn": null, ...}, "enzyme2": {"yield": "85.0%", ...}}'
+            prompt = campaign_context + PROMPT_EXTRACT_METRICS + f"\n\nExtract performance data for ALL these enzyme variants:\n{enzyme_names}\n\nReturn a JSON object with enzyme names as keys, each containing the metrics.\nExample format: {format_example}\n\n=== CONTEXT ===\n" + snippet[:4000]
             LOGGER.info("Gemini: extracting metrics for %d enzymes from %s…", len(enzyme_list), ref)
             tag = f"extract_metrics_batch"

debase/wrapper.py CHANGED Viewed

@@ -46,101 +46,333 @@ def run_sequence_cleanup(input_csv: Path, output_csv: Path) -> Path:
     """
     Step 2: Clean and validate protein sequences
     Calls: cleanup_sequence.py
+    Returns output path even if cleanup fails (copies input file)
     """
     logger.info(f"Cleaning sequences from {input_csv.name}")
-    from .cleanup_sequence import main as cleanup_sequences
-    cleanup_sequences([str(input_csv), str(output_csv)])
-    logger.info(f"Sequence cleanup complete: {output_csv}")
-    return output_csv
+    try:
+        from .cleanup_sequence import main as cleanup_sequences
+        cleanup_sequences([str(input_csv), str(output_csv)])
+        logger.info(f"Sequence cleanup complete: {output_csv}")
+        return output_csv
+    except Exception as e:
+        logger.warning(f"Sequence cleanup failed: {e}")
+        logger.info("Copying original file to continue pipeline...")
+        # Copy the input file as-is to continue pipeline
+        import shutil
+        shutil.copy2(input_csv, output_csv)
+        logger.info(f"Original file copied: {output_csv}")
+        return output_csv
 def run_reaction_extraction(manuscript: Path, si: Path, lineage_csv: Path, output: Path, debug_dir: Path = None) -> Path:
     """
     Step 3a: Extract reaction performance metrics
     Calls: reaction_info_extractor.py
+    Returns output path even if extraction fails (creates empty file)
     """
     logger.info(f"Extracting reaction info for enzymes in {lineage_csv.name}")
-    from .reaction_info_extractor import ReactionExtractor, Config
-    import pandas as pd
-    # Load enzyme data
-    enzyme_df = pd.read_csv(lineage_csv)
-    # Initialize extractor and run
-    cfg = Config()
-    extractor = ReactionExtractor(manuscript, si, cfg, debug_dir=debug_dir)
-    df_metrics = extractor.run(enzyme_df)
-    # Save results
-    df_metrics.to_csv(output, index=False)
-    logger.info(f"Reaction extraction complete: {output}")
-    return output
+    try:
+        from .reaction_info_extractor import ReactionExtractor, Config
+        import pandas as pd
+        # Load enzyme data
+        enzyme_df = pd.read_csv(lineage_csv)
+        # Initialize extractor and run
+        cfg = Config()
+        extractor = ReactionExtractor(manuscript, si, cfg, debug_dir=debug_dir)
+        df_metrics = extractor.run(enzyme_df)
+        # Save results
+        df_metrics.to_csv(output, index=False)
+        logger.info(f"Reaction extraction complete: {output}")
+        return output
+    except Exception as e:
+        logger.warning(f"Reaction extraction failed: {e}")
+        logger.info("Creating empty reaction info file to continue pipeline...")
+        # Create empty reaction CSV with basic columns
+        import pandas as pd
+        empty_df = pd.DataFrame(columns=[
+            'enzyme', 'substrate', 'product', 'yield_percent', 'ee_percent',
+            'conversion_percent', 'reaction_type', 'reaction_conditions', 'notes'
+        ])
+        empty_df.to_csv(output, index=False)
+        logger.info(f"Empty reaction file created: {output}")
+        return output
 def run_substrate_scope_extraction(manuscript: Path, si: Path, lineage_csv: Path, output: Path, debug_dir: Path = None) -> Path:
     """
     Step 3b: Extract substrate scope data (runs in parallel with reaction extraction)
     Calls: substrate_scope_extractor.py
+    Returns output path even if extraction fails (creates empty file)
     """
     logger.info(f"Extracting substrate scope for enzymes in {lineage_csv.name}")
-    from .substrate_scope_extractor import run_pipeline
+    try:
+        from .substrate_scope_extractor import run_pipeline
+        # Run substrate scope extraction
+        run_pipeline(
+            manuscript=manuscript,
+            si=si,
+            lineage_csv=lineage_csv,
+            output_csv=output,
+            debug_dir=debug_dir
+        )
+        logger.info(f"Substrate scope extraction complete: {output}")
+        return output
+    except Exception as e:
+        logger.warning(f"Substrate scope extraction failed: {e}")
+        logger.info("Creating empty substrate scope file to continue pipeline...")
+        # Create empty substrate scope CSV with proper headers
+        import pandas as pd
+        empty_df = pd.DataFrame(columns=[
+            'enzyme', 'substrate', 'product', 'yield_percent', 'ee_percent',
+            'conversion_percent', 'selectivity', 'reaction_conditions', 'notes'
+        ])
+        empty_df.to_csv(output, index=False)
+        logger.info(f"Empty substrate scope file created: {output}")
+        return output
+def match_enzyme_variants_with_gemini(lineage_enzymes: list, data_enzymes: list, model=None) -> dict:
+    """
+    Use Gemini to match enzyme variant IDs between different datasets.
+    Returns a mapping of data_enzyme_id -> lineage_enzyme_id.
+    """
+    import json
-    # Run substrate scope extraction
-    run_pipeline(
-        manuscript=manuscript,
-        si=si,
-        lineage_csv=lineage_csv,
-        output_csv=output,
-        debug_dir=debug_dir
-    )
+    if not model:
+        try:
+            from .enzyme_lineage_extractor import get_model
+            model = get_model()
+        except:
+            logger.warning("Could not load Gemini model for variant matching")
+            return {}
-    logger.info(f"Substrate scope extraction complete: {output}")
-    return output
+    prompt = f"""Match enzyme variant IDs between two lists from the same scientific paper.
+These lists come from different sections or analyses of the same study, but may use different naming conventions.
+List 1 (from lineage/sequence data):
+{json.dumps(lineage_enzymes)}
+List 2 (from experimental data):
+{json.dumps(data_enzymes)}
+Analyze the patterns and match variants that refer to the same enzyme.
+Return ONLY a JSON object mapping IDs from List 2 to their corresponding IDs in List 1.
+Format: {{"list2_id": "list1_id", ...}}
+Only include matches you are confident about based on the naming patterns.
+"""
+    try:
+        response = model.generate_content(prompt)
+        mapping_text = response.text.strip()
+        # Extract JSON from response
+        if '```json' in mapping_text:
+            mapping_text = mapping_text.split('```json')[1].split('```')[0].strip()
+        elif '```' in mapping_text:
+            mapping_text = mapping_text.split('```')[1].split('```')[0].strip()
+        mapping = json.loads(mapping_text)
+        logger.info(f"Gemini matched {len(mapping)} enzyme variants")
+        for k, v in mapping.items():
+            logger.info(f"  Matched '{k}' -> '{v}'")
+        return mapping
+    except Exception as e:
+        logger.warning(f"Failed to match variants with Gemini: {e}")
+        return {}
 def run_lineage_format(reaction_csv: Path, substrate_scope_csv: Path, cleaned_csv: Path, output_csv: Path) -> Path:
     """
     Step 4: Format and merge all data into final CSV
-    Calls: lineage_format.py
+    Creates comprehensive format merging all available data, even if some extraction steps failed
     """
     logger.info(f"Formatting and merging data into final output")
-    from .lineage_format import run_pipeline
-    import pandas as pd
-    # First, we need to merge the protein sequences into the reaction data
-    df_reaction = pd.read_csv(reaction_csv)
-    df_sequences = pd.read_csv(cleaned_csv)
-    # Merge sequences into reaction data
-    # Include generation and parent info for proper mutation calculation
-    sequence_cols = ['protein_sequence', 'dna_seq', 'seq_confidence', 'truncated', 'flag',
-                     'generation', 'parent_enzyme_id', 'mutations']
-    sequence_data = df_sequences[['enzyme_id'] + [col for col in sequence_cols if col in df_sequences.columns]]
-    # Merge on enzyme_id or variant_id
-    if 'enzyme_id' in df_reaction.columns:
-        df_reaction = df_reaction.merge(sequence_data, on='enzyme_id', how='left', suffixes=('', '_seq'))
-    elif 'enzyme' in df_reaction.columns:
-        sequence_data = sequence_data.rename(columns={'enzyme_id': 'enzyme'})
-        df_reaction = df_reaction.merge(sequence_data, on='enzyme', how='left', suffixes=('', '_seq'))
-    # Save the merged reaction data
-    df_reaction.to_csv(reaction_csv, index=False)
-    # Run the formatting pipeline
-    df_final = run_pipeline(
-        reaction_csv=reaction_csv,
-        substrate_scope_csv=substrate_scope_csv,
-        output_csv=output_csv
-    )
-    logger.info(f"Final formatting complete: {output_csv}")
-    return output_csv
+    try:
+        import pandas as pd
+        # Read all available data files
+        logger.info("Reading enzyme lineage data...")
+        df_lineage = pd.read_csv(cleaned_csv)
+        logger.info("Reading reaction data...")
+        try:
+            df_reaction = pd.read_csv(reaction_csv)
+            has_reaction_data = len(df_reaction) > 0 and not df_reaction.empty
+        except:
+            df_reaction = pd.DataFrame()
+            has_reaction_data = False
+        logger.info("Reading substrate scope data...")
+        try:
+            df_scope = pd.read_csv(substrate_scope_csv)
+            has_scope_data = len(df_scope) > 0 and not df_scope.empty
+        except:
+            df_scope = pd.DataFrame()
+            has_scope_data = False
+        # Start with lineage data as base
+        df_final = df_lineage.copy()
+        # Ensure consistent enzyme ID column
+        if 'variant_id' in df_final.columns and 'enzyme_id' not in df_final.columns:
+            df_final = df_final.rename(columns={'variant_id': 'enzyme_id'})
+        # Merge reaction data if available
+        if has_reaction_data:
+            logger.info(f"Merging reaction data ({len(df_reaction)} records)")
+            # Match on enzyme_id or enzyme
+            merge_key = 'enzyme_id' if 'enzyme_id' in df_reaction.columns else 'enzyme'
+            if merge_key in df_reaction.columns:
+                df_final = df_final.merge(df_reaction, left_on='enzyme_id', right_on=merge_key, how='left', suffixes=('', '_reaction'))
+        else:
+            logger.info("No reaction data available")
+        # Merge substrate scope data if available
+        if has_scope_data:
+            logger.info(f"Merging substrate scope data ({len(df_scope)} records)")
+            merge_key = 'enzyme_id' if 'enzyme_id' in df_scope.columns else 'enzyme'
+            if merge_key in df_scope.columns:
+                # First try direct merge
+                df_test_merge = df_final.merge(df_scope, left_on='enzyme_id', right_on=merge_key, how='left', suffixes=('', '_scope'))
+                # Check if any matches were found
+                matched_count = df_test_merge[merge_key + '_scope'].notna().sum() if merge_key + '_scope' in df_test_merge.columns else 0
+                if matched_count == 0:
+                    logger.info("No direct matches found, using Gemini to match enzyme variants...")
+                    # Get unique enzyme IDs from both datasets
+                    lineage_enzymes = df_final['enzyme_id'].dropna().unique().tolist()
+                    scope_enzymes = df_scope[merge_key].dropna().unique().tolist()
+                    # Get mapping from Gemini
+                    mapping = match_enzyme_variants_with_gemini(lineage_enzymes, scope_enzymes)
+                    if mapping:
+                        # Apply mapping to scope data
+                        df_scope_mapped = df_scope.copy()
+                        df_scope_mapped[merge_key] = df_scope_mapped[merge_key].map(lambda x: mapping.get(x, x))
+                        df_final = df_final.merge(df_scope_mapped, left_on='enzyme_id', right_on=merge_key, how='left', suffixes=('', '_scope'))
+                    else:
+                        logger.warning("Could not match enzyme variants between datasets")
+                        df_final = df_test_merge
+                else:
+                    df_final = df_test_merge
+                    logger.info(f"Direct merge matched {matched_count} records")
+        else:
+            logger.info("No substrate scope data available")
+        # Add comprehensive column structure for missing data
+        essential_columns = [
+            'enzyme_id', 'parent_id', 'generation', 'mutations', 'campaign_id', 'notes',
+            'aa_seq', 'dna_seq', 'seq_confidence', 'truncated', 'seq_source', 'doi',
+            'substrate_list', 'substrate_iupac_list', 'product_list', 'product_iupac_list',
+            'cofactor_list', 'cofactor_iupac_list', 'yield', 'ee', 'ttn',
+            'reaction_temperature', 'reaction_ph', 'reaction_buffer', 'reaction_other_conditions',
+            'data_location'
+        ]
+        # Add missing columns with NaN
+        for col in essential_columns:
+            if col not in df_final.columns:
+                df_final[col] = None
+        # Clean up duplicate columns from merging
+        columns_to_keep = []
+        seen_base_names = set()
+        for col in df_final.columns:
+            base_name = col.split('_reaction')[0].split('_scope')[0]
+            if base_name not in seen_base_names:
+                columns_to_keep.append(col)
+                seen_base_names.add(base_name)
+            elif col.endswith('_scope') or col.endswith('_reaction'):
+                # Prefer scope or reaction data over base lineage data for certain columns
+                if base_name in ['substrate_list', 'product_list', 'yield', 'ee', 'reaction_temperature']:
+                    columns_to_keep.append(col)
+                    # Remove the base column if it exists
+                    if base_name in columns_to_keep:
+                        columns_to_keep.remove(base_name)
+                    seen_base_names.add(base_name)
+        df_final = df_final[columns_to_keep]
+        # Rename merged columns back to standard names
+        rename_map = {}
+        for col in df_final.columns:
+            if col.endswith('_scope') or col.endswith('_reaction'):
+                base_name = col.split('_scope')[0].split('_reaction')[0]
+                rename_map[col] = base_name
+        df_final = df_final.rename(columns=rename_map)
+        # Save the comprehensive final output
+        df_final.to_csv(output_csv, index=False)
+        logger.info(f"Final comprehensive format complete: {output_csv}")
+        logger.info(f"Final output contains {len(df_final)} variants with {len(df_final.columns)} data columns")
+        # Log what data was successfully merged
+        if has_reaction_data:
+            logger.info("✓ Reaction performance data merged")
+        if has_scope_data:
+            logger.info("✓ Substrate scope data merged")
+        # Now run the actual lineage format to produce plate-based format
+        logger.info("\nRunning lineage format to produce plate-based output...")
+        try:
+            from .lineage_format import flatten_dataframe
+            # Create the plate-based output filename
+            plate_output = output_csv.parent / (output_csv.stem + "_plate_format.csv")
+            # Flatten the dataframe to plate format
+            df_flattened = flatten_dataframe(df_final)
+            # Save the flattened output
+            df_flattened.to_csv(plate_output, index=False)
+            logger.info(f"✓ Plate-based format saved to: {plate_output}")
+            logger.info(f"  Contains {len(df_flattened)} rows with plate/well assignments")
+            # Update the final output path to be the plate format
+            output_csv = plate_output
+        except Exception as e:
+            logger.warning(f"Could not generate plate-based format: {e}")
+            logger.info("Comprehensive format will be used as final output")
+        return output_csv
+    except Exception as e:
+        logger.warning(f"Final formatting failed: {e}")
+        logger.info("Using cleaned sequence data as final output...")
+        # Copy the cleaned CSV as the final output
+        import shutil
+        shutil.copy2(cleaned_csv, output_csv)
+        logger.info(f"Cleaned sequence file used as final output: {output_csv}")
+        return output_csv
 def run_pipeline(
@@ -206,7 +438,7 @@ def run_pipeline(
         # Step 4: Format and merge
         logger.info("\n[Step 4/5] Formatting and merging data...")
-        run_lineage_format(reaction_csv, substrate_csv, cleaned_csv, output_path)
+        final_output = run_lineage_format(reaction_csv, substrate_csv, cleaned_csv, output_path)
         # Step 5: Finalize
         logger.info("\n[Step 5/5] Finalizing...")
@@ -219,11 +451,13 @@ def run_pipeline(
         logger.info("\n" + "="*60)
         logger.info("PIPELINE COMPLETED SUCCESSFULLY")
-        logger.info(f"Output: {output_path}")
+        logger.info(f"Comprehensive output: {output_path}")
+        if final_output != output_path:
+            logger.info(f"Plate-based output: {final_output}")
         logger.info(f"Runtime: {elapsed:.1f} seconds")
         logger.info("="*60)
-        return output_path
+        return final_output
     except Exception as e:
         logger.error(f"Pipeline failed: {str(e)}")

{debase-0.1.16.dist-info → debase-0.1.18.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: debase
-Version: 0.1.16
+Version: 0.1.18
 Summary: Enzyme lineage analysis and sequence extraction package
 Home-page: https://github.com/YuemingLong/DEBase
 Author: DEBase Team

debase-0.1.18.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,17 @@
+debase/PIPELINE_FLOW.md,sha256=S4nQyZlX39-Bchw1gQWPK60sHiFpB1eWHqo5GR9oTY8,4741
+debase/__init__.py,sha256=YeKveGj_8fwuu5ozoK2mUU86so_FjiCwsvg1d_lYVZU,586
+debase/__main__.py,sha256=LbxYt2x9TG5Ced7LpzzX_8gkWyXeZSlVHzqHfqAiPwQ,160
+debase/_version.py,sha256=Qd1kKsssesKE5FvJnDdAuZsx_BrxTSJJyt68SK99D54,50
+debase/build_db.py,sha256=bW574GxsL1BJtDwM19urLbciPcejLzfraXZPpzm09FQ,7167
+debase/cleanup_sequence.py,sha256=QyhUqvTBVFTGM7ebAHmP3tif3Jq-8hvoLApYwAJtpH4,32702
+debase/enzyme_lineage_extractor.py,sha256=xbNKkIMRCM2dYHsX24vWX1EsQINaGSWBj-iTX10B8Mw,117057
+debase/lineage_format.py,sha256=IS9ig-Uv7KxtI9enZKM6YgQ7sitqwOo4cdXbOy38J3s,34232
+debase/reaction_info_extractor.py,sha256=W9CS0puFTdhJ_T2Fpy931EgnjOCsHHjbtU6RdnzDlhw,113140
+debase/substrate_scope_extractor.py,sha256=9XDF-DxOqB63AwaVceAMvg7BcjoTQXE_pG2c_seM_DA,100698
+debase/wrapper.py,sha256=V9bs8ZiyCpJHMM5VuN74kiKdkQRVU6vyvLKCrO1BUB8,20890
+debase-0.1.18.dist-info/licenses/LICENSE,sha256=5sk9_tcNmr1r2iMIUAiioBo7wo38u8BrPlO7f0seqgE,1075
+debase-0.1.18.dist-info/METADATA,sha256=XvSrveJ0Y40c53JYUfiveaQNJ3qoEkxaQ61n3_--1cQ,10790
+debase-0.1.18.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
+debase-0.1.18.dist-info/entry_points.txt,sha256=hUcxA1b4xORu-HHBFTe9u2KTdbxPzt0dwz95_6JNe9M,48
+debase-0.1.18.dist-info/top_level.txt,sha256=2BUeq-4kmQr0Rhl06AnRzmmZNs8WzBRK9OcJehkcdk8,7
+debase-0.1.18.dist-info/RECORD,,

debase-0.1.16.dist-info/RECORD DELETED Viewed

@@ -1,16 +0,0 @@
-debase/__init__.py,sha256=YeKveGj_8fwuu5ozoK2mUU86so_FjiCwsvg1d_lYVZU,586
-debase/__main__.py,sha256=LbxYt2x9TG5Ced7LpzzX_8gkWyXeZSlVHzqHfqAiPwQ,160
-debase/_version.py,sha256=l25FRqoNjxB5d3qBHsLMMA_9YWsIZ7nJ5BiTLj0qYE8,50
-debase/build_db.py,sha256=bW574GxsL1BJtDwM19urLbciPcejLzfraXZPpzm09FQ,7167
-debase/cleanup_sequence.py,sha256=QyhUqvTBVFTGM7ebAHmP3tif3Jq-8hvoLApYwAJtpH4,32702
-debase/enzyme_lineage_extractor.py,sha256=jNxNCh8VF0dUFxUlTall0w1-oQojXRXLnWcuPFs5ij8,106879
-debase/lineage_format.py,sha256=mACni9M1RXA_1tIyDZJpStQoutd_HLG2qQMAORTusZs,30045
-debase/reaction_info_extractor.py,sha256=9DkEZh7TgsxKpFkKbLyUhS_w0Z84LczkDFv-v_NEHE4,112174
-debase/substrate_scope_extractor.py,sha256=9XDF-DxOqB63AwaVceAMvg7BcjoTQXE_pG2c_seM_DA,100698
-debase/wrapper.py,sha256=lTx375a57EVuXcZ_roXaj5UDj8HjRcb5ViNaSgPN4Ik,10352
-debase-0.1.16.dist-info/licenses/LICENSE,sha256=5sk9_tcNmr1r2iMIUAiioBo7wo38u8BrPlO7f0seqgE,1075
-debase-0.1.16.dist-info/METADATA,sha256=7sv2OcIuHaoOImkBdoEtRzyOjp9Kuoz2ZmgK4tosaUc,10790
-debase-0.1.16.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
-debase-0.1.16.dist-info/entry_points.txt,sha256=hUcxA1b4xORu-HHBFTe9u2KTdbxPzt0dwz95_6JNe9M,48
-debase-0.1.16.dist-info/top_level.txt,sha256=2BUeq-4kmQr0Rhl06AnRzmmZNs8WzBRK9OcJehkcdk8,7
-debase-0.1.16.dist-info/RECORD,,

{debase-0.1.16.dist-info → debase-0.1.18.dist-info}/WHEEL RENAMED Viewed

File without changes

{debase-0.1.16.dist-info → debase-0.1.18.dist-info}/entry_points.txt RENAMED Viewed

File without changes

{debase-0.1.16.dist-info → debase-0.1.18.dist-info}/licenses/LICENSE RENAMED Viewed

File without changes

{debase-0.1.16.dist-info → debase-0.1.18.dist-info}/top_level.txt RENAMED Viewed

File without changes

debase 0.1.16__py3-none-any.whl → 0.1.18__py3-none-any.whl

debase 0.1.16py3-none-any.whl → 0.1.18py3-none-any.whl