PyPI - statement-parser - Versions diffs - 0.0.14__tar.gz → 0.1.0__tar.gz - Mend

statement-parser 0.0.14tar.gz → 0.1.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

{statement_parser-0.0.14/statement_parser.egg-info → statement_parser-0.1.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: statement_parser
-Version: 0.0.14
+Version: 0.1.0
 Summary: Bank Statement Parser is a Python library designed to parse and normalize transaction data from various bank statement formats ( CSV, Excel, etc.) into a consistent and easy-to-use Pandas DataFrame. It supports multiple banks and file formats, making it a versatile tool for financial data analysis.
 Author-email: Khuzema Challawala <khuzema.ac@gmail.com>
 Classifier: Programming Language :: Python :: 3
@@ -34,10 +34,11 @@ Dynamic: license-file
 ## Features
 - **Multi-Format Support**: Parse bank statements from  CSV, Excel, and more.
-- **Bank-Specific Parsing**: Customizable parsers for different banks.
-- **Consistent Output**: Normalized transaction data with standardized columns (`Date`, `Description`, `Amount`,  etc.).
+- **Config-Driven**: All parsing behaviour is described by a config dict — no per-bank classes to maintain.
+- **Resilient**: Tolerant of header spacing/punctuation changes and messy number/date formatting.
+- **Consistent Output**: Normalized transaction data with standardized columns (`bank`, `created_date`, `remarks`, `amount`, `hash`).
 - **Easy Integration**: Simple API for quick integration into your Python projects.
-- **Extensible**: Add support for new banks or formats with minimal effort.
+- **Extensible**: Add support for a new bank by writing config, not code.
 ---
@@ -51,13 +52,50 @@ pip install statement_parser
 # Usage
-### Basic Example
+### Using a bundled preset
+```python
+from statement_parser import GenericBank, list_banks
+print(list_banks())  # ['HDFC-CREDIT', 'HSBC-CREDIT', ...]
+parser = GenericBank.from_builtin("HSBC-CREDIT")
+df = parser.getDataFrame("path/to/statement.csv")
+print(df.head())
+```
+### Bring your own config
+No subclassing required — define a config dict and hand it to `GenericBank`:
 ```python
-from statement_parser.banks.HdfcCredit import HdfcCredit
+from statement_parser import GenericBank
+config = {
+    "file": {
+        "delimiter": ",",
+        "header": {
+            "mode": "detect",                       # detect | fixed | none
+            "match": ["date", "description", "amount"],
+            "min_matches": 2,
+        },
+    },
+    "columns": {                                    # logical -> candidate names
+        "date": ["Txn Date", "Date"],
+        "details": ["Description", "Narration"],
+        "amount": ["Amount"],
+    },
+    "date_field": "date",
+    "remarks": [{"field": "details"}],
+    "amount": {"mode": "direct", "field": "amount"},
+}
-parser = HsbcCredit()
+parser = GenericBank(config, bank_id="MY-BANK")
 df = parser.getDataFrame("path/to/statement.csv")
-# Display the parsed transactions
 print(df.head())
 ```
+**Amount modes**: `direct` (single amount column), `signed` (amount column whose
+sign is decided by a CR/DR column), or `deposit_minus_withdrawal` (separate
+deposit and withdrawal columns).

statement_parser-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,78 @@
+# Bank Statement Parser
+![Python Version](https://img.shields.io/badge/python-3.7%2B-blue)
+![License](https://img.shields.io/badge/license-MIT-green)
+![PyPI Version](https://img.shields.io/pypi/v/bank-statement-parser)
+**Bank Statement Parser** is a Python library designed to parse and normalize transaction data from various bank statement formats ( CSV, Excel, etc.) into a consistent and easy-to-use Pandas DataFrame. It supports multiple banks and file formats, making it a versatile tool for financial data analysis.
+---
+## Features
+- **Multi-Format Support**: Parse bank statements from  CSV, Excel, and more.
+- **Config-Driven**: All parsing behaviour is described by a config dict — no per-bank classes to maintain.
+- **Resilient**: Tolerant of header spacing/punctuation changes and messy number/date formatting.
+- **Consistent Output**: Normalized transaction data with standardized columns (`bank`, `created_date`, `remarks`, `amount`, `hash`).
+- **Easy Integration**: Simple API for quick integration into your Python projects.
+- **Extensible**: Add support for a new bank by writing config, not code.
+---
+## Installation
+You can install the library via pip:
+```bash
+pip install statement_parser
+```
+# Usage
+### Using a bundled preset
+```python
+from statement_parser import GenericBank, list_banks
+print(list_banks())  # ['HDFC-CREDIT', 'HSBC-CREDIT', ...]
+parser = GenericBank.from_builtin("HSBC-CREDIT")
+df = parser.getDataFrame("path/to/statement.csv")
+print(df.head())
+```
+### Bring your own config
+No subclassing required — define a config dict and hand it to `GenericBank`:
+```python
+from statement_parser import GenericBank
+config = {
+    "file": {
+        "delimiter": ",",
+        "header": {
+            "mode": "detect",                       # detect | fixed | none
+            "match": ["date", "description", "amount"],
+            "min_matches": 2,
+        },
+    },
+    "columns": {                                    # logical -> candidate names
+        "date": ["Txn Date", "Date"],
+        "details": ["Description", "Narration"],
+        "amount": ["Amount"],
+    },
+    "date_field": "date",
+    "remarks": [{"field": "details"}],
+    "amount": {"mode": "direct", "field": "amount"},
+}
+parser = GenericBank(config, bank_id="MY-BANK")
+df = parser.getDataFrame("path/to/statement.csv")
+print(df.head())
+```
+**Amount modes**: `direct` (single amount column), `signed` (amount column whose
+sign is decided by a CR/DR column), or `deposit_minus_withdrawal` (separate
+deposit and withdrawal columns).

{statement_parser-0.0.14 → statement_parser-0.1.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "statement_parser"
-version = "0.0.14"
+version = "0.1.0"
 authors = [
     { name="Khuzema Challawala", email="khuzema.ac@gmail.com" },
 ]
@@ -30,4 +30,7 @@ dev = [
     "pytest-cov>=6.0.0",
     "flake8>=7.1.2",
     "sphinx"
-]
+]
+[tool.setuptools.package-data]
+statement_parser = ["bank_configs.json"]

statement_parser-0.1.0/statement_parser/GenericBank.py ADDED Viewed

@@ -0,0 +1,312 @@
+from __future__ import annotations
+import json
+import re
+from pathlib import Path
+import pandas as pd
+from dateutil import parser as date_parser
+from statement_parser.Bank import Bank
+from statement_parser.Transaction import Transaction
+CONFIG_PATH = Path(__file__).parent / "bank_configs.json"
+_CONFIG_CACHE: dict | None = None
+def load_configs() -> dict:
+    """Load and cache the bank configuration file."""
+    global _CONFIG_CACHE
+    if _CONFIG_CACHE is None:
+        with open(CONFIG_PATH, "r", encoding="utf-8") as handle:
+            _CONFIG_CACHE = json.load(handle)
+    return _CONFIG_CACHE
+def list_banks() -> list[str]:
+    """Return the list of configured bank ids."""
+    return list(load_configs().keys())
+def _normalize(name) -> str:
+    """Normalize a column name so spacing/punctuation changes don't matter."""
+    return re.sub(r"[^a-z0-9]", "", str(name).lower())
+class GenericBank(Bank):
+    """
+    A single, configuration-driven statement parser.
+    Behaviour (delimiters, header location, column names, how the
+    amount/remarks are derived) is described entirely by a ``config`` dict, so
+    supporting a new bank means writing config - not code.
+    Usage::
+        # Bring your own config
+        parser = GenericBank(my_config, bank_id="MY-BANK")
+        df = parser.getDataFrame("statement.csv")
+        # Or use one of the bundled presets
+        parser = GenericBank.from_builtin("HDFC-CREDIT")
+    The engine is intentionally tolerant:
+      * Column names are matched after normalization, so an added/removed
+        space or punctuation change in a header does not break parsing.
+      * Numbers are cleaned (commas, currency symbols, blanks) before casting.
+      * Dates are parsed leniently and rows without a valid date are dropped,
+        which also removes header/footer junk automatically.
+    """
+    def __init__(self, config: dict, bank_id: str | None = None):
+        if not isinstance(config, dict):
+            raise TypeError(
+                "config must be a dict describing how to parse the statement. "
+                "Use GenericBank.from_builtin(<id>) for a bundled preset."
+            )
+        self.config = config
+        self.bank_id = bank_id or config.get("bank_id", "UNKNOWN")
+    @classmethod
+    def from_builtin(cls, bank_id: str) -> "GenericBank":
+        """Create a parser from one of the configs in ``bank_configs.json``."""
+        configs = load_configs()
+        if bank_id not in configs:
+            raise ValueError(
+                f"No built-in configuration for '{bank_id}'. "
+                f"Available: {', '.join(configs)}"
+            )
+        return cls(config=configs[bank_id], bank_id=bank_id)
+    # ------------------------------------------------------------------ #
+    # Public API
+    # ------------------------------------------------------------------ #
+    def getTransactions(self, filename: str) -> list[Transaction]:
+        df = self.getData(filename)
+        return self._build_transactions(df)
+    def getData(self, filename: str) -> pd.DataFrame:
+        df = self._load(filename)
+        df.columns = [str(c).strip() for c in df.columns]
+        return df
+    # ------------------------------------------------------------------ #
+    # Loading
+    # ------------------------------------------------------------------ #
+    def _load(self, filename: str) -> pd.DataFrame:
+        file_cfg = self.config.get("file", {})
+        header_cfg = file_cfg.get("header", {"mode": "fixed", "row": 0})
+        mode = header_cfg.get("mode", "fixed")
+        delimiter = file_cfg.get("delimiter", ",")
+        is_excel = filename.endswith((".xls", ".xlsx"))
+        if mode == "none":
+            names = header_cfg.get("names")
+            if is_excel:
+                df = pd.read_excel(filename, header=None)
+            else:
+                df = pd.read_csv(
+                    filename,
+                    delimiter=delimiter,
+                    header=None,
+                    engine="python",
+                    on_bad_lines="skip",
+                )
+            if names:
+                df = df.iloc[:, : len(names)]
+                df.columns = names
+            return df
+        if mode == "detect":
+            skip_rows = self._find_header_row(
+                filename,
+                [t.lower() for t in header_cfg.get("match", [])],
+                header_cfg.get("min_matches", 1),
+                delimiter,
+                is_excel,
+            )
+        else:
+            skip_rows = header_cfg.get("row", 0)
+        if is_excel:
+            return pd.read_excel(filename, skiprows=skip_rows)
+        return pd.read_csv(
+            filename,
+            delimiter=delimiter,
+            skiprows=skip_rows,
+            engine="python",
+            on_bad_lines="skip",
+        )
+    def _find_header_row(self, filename, tokens, min_matches,
+                         delimiter, is_excel) -> int:
+        if is_excel:
+            raw = pd.read_excel(filename, header=None)
+            lines = [
+                " ".join(str(v) for v in row if pd.notna(v)).lower()
+                for row in raw.values.tolist()
+            ]
+        else:
+            with open(filename, "r", encoding="utf-8") as handle:
+                lines = [line.lower() for line in handle.readlines()]
+        for i, line in enumerate(lines):
+            matches = sum(1 for token in tokens if token in line)
+            if matches >= min_matches:
+                return i
+        raise ValueError(
+            f"Could not locate header row for '{self.bank_id}'. "
+            f"Expected at least {min_matches} of: {tokens}"
+        )
+    # ------------------------------------------------------------------ #
+    # Column resolution
+    # ------------------------------------------------------------------ #
+    def _resolve_columns(self, df: pd.DataFrame) -> dict:
+        """Map each logical field to an actual dataframe column."""
+        normalized = {}
+        for actual in df.columns:
+            normalized.setdefault(_normalize(actual), actual)
+        resolved = {}
+        for logical, candidates in self.config.get("columns", {}).items():
+            if isinstance(candidates, str):
+                candidates = [candidates]
+            found = None
+            for candidate in candidates:
+                key = _normalize(candidate)
+                if key in normalized:
+                    found = normalized[key]
+                    break
+            if found is None:
+                raise ValueError(
+                    f"[{self.bank_id}] Could not find a column for "
+                    f"'{logical}'. Tried {candidates}. "
+                    f"Available columns: {list(df.columns)}"
+                )
+            resolved[logical] = found
+        return resolved
+    # ------------------------------------------------------------------ #
+    # Value helpers
+    # ------------------------------------------------------------------ #
+    @staticmethod
+    def _to_float(series: pd.Series) -> pd.Series:
+        cleaned = (
+            series.astype(str)
+            .str.replace(",", "", regex=False)
+            .str.replace(r"[^0-9.\-]", "", regex=True)
+            .str.strip()
+            .replace("", "0")
+        )
+        return pd.to_numeric(cleaned, errors="coerce")
+    @staticmethod
+    def _to_date(series: pd.Series) -> pd.Series:
+        def parse(value):
+            text = str(value).strip()
+            if not text:
+                return pd.NaT
+            # A bare number (id/serial column) is not a real date. dateutil
+            # would happily turn "1" into a date, so reject these explicitly.
+            if re.fullmatch(r"[+-]?\d+(\.\d+)?", text):
+                return pd.NaT
+            try:
+                parsed = date_parser.parse(text, dayfirst=True)
+                if parsed.year < 1900:
+                    return pd.NaT
+                return parsed
+            except (ValueError, OverflowError, TypeError):
+                return pd.NaT
+        return series.apply(parse)
+    # ------------------------------------------------------------------ #
+    # Transaction building
+    # ------------------------------------------------------------------ #
+    def _build_transactions(self, df: pd.DataFrame) -> list[Transaction]:
+        cols = self._resolve_columns(df)
+        work = pd.DataFrame(index=df.index)
+        # Date (used to drop non-transaction rows automatically)
+        date_field = self.config["date_field"]
+        work["_date"] = self._to_date(df[cols[date_field]])
+        work = work[work["_date"].notna()]
+        df = df.loc[work.index]
+        # Amount
+        work["_amount"] = self._compute_amount(df, cols).loc[work.index]
+        # Remarks (without the duplicate marker yet)
+        work["_remarks"] = self._compute_remarks(df, cols).loc[work.index]
+        # Duplicate sequence marker
+        seq = (
+            work.groupby(["_date", "_remarks", "_amount"]).cumcount().add(1)
+        )
+        transactions: list[Transaction] = []
+        for idx, row in work.iterrows():
+            remarks = row["_remarks"]
+            if seq[idx] > 1:
+                remarks = remarks + " (" + str(seq[idx]) + ") "
+            transactions.append(
+                Transaction(
+                    bank=self.bank_id,
+                    created_date=row["_date"],
+                    remarks=remarks,
+                    amount=row["_amount"],
+                )
+            )
+        return transactions
+    def _compute_amount(self, df: pd.DataFrame, cols: dict) -> pd.Series:
+        cfg = self.config["amount"]
+        mode = cfg.get("mode", "direct")
+        if mode == "direct":
+            return self._to_float(df[cols[cfg["field"]]]).fillna(0)
+        if mode == "deposit_minus_withdrawal":
+            deposit = self._to_float(df[cols[cfg["deposit"]]]).fillna(0)
+            withdrawal = self._to_float(df[cols[cfg["withdrawal"]]]).fillna(0)
+            return deposit - withdrawal
+        if mode == "signed":
+            value = self._to_float(df[cols[cfg["field"]]]).fillna(0)
+            credit_values = {
+                str(v).upper() for v in cfg.get("credit_values", ["CR"])
+            }
+            credit_sign = cfg.get("credit_sign", 1)
+            debit_sign = cfg.get("debit_sign", -1)
+            sign_col = df[cols[cfg["sign_field"]]].astype(str)
+            sign = sign_col.str.upper().str.strip().map(
+                lambda v: credit_sign if v in credit_values else debit_sign
+            )
+            return value * sign
+        raise ValueError(f"[{self.bank_id}] Unknown amount mode '{mode}'")
+    def _compute_remarks(self, df: pd.DataFrame, cols: dict) -> pd.Series:
+        parts_cfg = self.config["remarks"]
+        result = pd.Series([""] * len(df), index=df.index)
+        for part in parts_cfg:
+            field = part["field"]
+            prefix = part.get("prefix", "")
+            suffix = part.get("suffix", "")
+            skip = {str(s).lower() for s in part.get("skip", [])}
+            column = df[cols[field]]
+            def render(value):
+                text = "" if pd.isna(value) else str(value).strip()
+                if text.lower() in skip:
+                    return ""
+                return prefix + text + suffix
+            result = result + column.map(render)
+        return result.str.strip()

statement_parser-0.1.0/statement_parser/__init__.py ADDED Viewed

@@ -0,0 +1,9 @@
+from .Bank import Bank
+from .Transaction import Transaction
+from .GenericBank import GenericBank, list_banks, load_configs
+__all__ = ['Bank',
+           'Transaction',
+           'GenericBank',
+           'list_banks',
+           'load_configs']

statement_parser-0.1.0/statement_parser/bank_configs.json ADDED Viewed

@@ -0,0 +1,198 @@
+{
+  "HDFC-CREDIT": {
+    "file": {
+      "delimiter": "~",
+      "header": {
+        "mode": "detect",
+        "match": ["transaction type", "description", "debit / credit"],
+        "min_matches": 2
+      }
+    },
+    "columns": {
+      "date": ["DATE"],
+      "description": ["Description"],
+      "amount": ["AMT"],
+      "sign": ["Debit / Credit"]
+    },
+    "date_field": "date",
+    "remarks": [
+      {"field": "description"}
+    ],
+    "amount": {
+      "mode": "signed",
+      "field": "amount",
+      "sign_field": "sign",
+      "credit_values": ["CR"]
+    }
+  },
+  "HSBC-CREDIT": {
+    "file": {
+      "delimiter": ",",
+      "header": {
+        "mode": "none",
+        "names": ["Date", "Transaction Details", "Amount"]
+      }
+    },
+    "columns": {
+      "date": ["Date"],
+      "details": ["Transaction Details"],
+      "amount": ["Amount"]
+    },
+    "date_field": "date",
+    "remarks": [
+      {"field": "details"}
+    ],
+    "amount": {
+      "mode": "direct",
+      "field": "amount"
+    }
+  },
+  "HSBC-DEBIT": {
+    "file": {
+      "header": {
+        "mode": "detect",
+        "match": ["date", "transaction details", "deposits", "withdrawals"],
+        "min_matches": 3
+      }
+    },
+    "columns": {
+      "date": ["Date"],
+      "details": ["Transaction Details"],
+      "deposit": ["Deposits"],
+      "withdrawal": ["Withdrawals"]
+    },
+    "date_field": "date",
+    "remarks": [
+      {"field": "details"}
+    ],
+    "amount": {
+      "mode": "deposit_minus_withdrawal",
+      "deposit": "deposit",
+      "withdrawal": "withdrawal"
+    }
+  },
+  "ICICI-CREDIT": {
+    "file": {
+      "delimiter": ",",
+      "header": {
+        "mode": "detect",
+        "match": ["date", "sr.no", "transaction details", "amount(in rs)"],
+        "min_matches": 3
+      }
+    },
+    "columns": {
+      "date": ["Date"],
+      "details": ["Transaction Details"],
+      "amount": ["Amount(in Rs)"],
+      "sign": ["BillingAmountSign"]
+    },
+    "date_field": "date",
+    "remarks": [
+      {"field": "details"}
+    ],
+    "amount": {
+      "mode": "signed",
+      "field": "amount",
+      "sign_field": "sign",
+      "credit_values": ["CR"]
+    }
+  },
+  "ICICI-DEBIT": {
+    "file": {
+      "header": {
+        "mode": "detect",
+        "match": [
+          "transaction date",
+          "cheque number",
+          "transaction remarks",
+          "withdrawal amount",
+          "deposit amount"
+        ],
+        "min_matches": 3
+      }
+    },
+    "columns": {
+      "date": ["Transaction Date"],
+      "cheque": ["Cheque Number"],
+      "remarks": ["Transaction Remarks"],
+      "withdrawal": ["Withdrawal Amount (INR )"],
+      "deposit": ["Deposit Amount (INR )"]
+    },
+    "date_field": "date",
+    "remarks": [
+      {"field": "cheque", "prefix": "CHQ: ", "suffix": " ",
+       "skip": ["-", "", "nan"]},
+      {"field": "remarks"}
+    ],
+    "amount": {
+      "mode": "deposit_minus_withdrawal",
+      "deposit": "deposit",
+      "withdrawal": "withdrawal"
+    }
+  },
+  "KOTAK-DEBIT": {
+    "file": {
+      "delimiter": ",",
+      "header": {
+        "mode": "detect",
+        "match": [
+          "sl. no.",
+          "transaction date",
+          "description",
+          "amount",
+          "dr / cr"
+        ],
+        "min_matches": 3
+      }
+    },
+    "columns": {
+      "date": ["Transaction Date"],
+      "description": ["Description"],
+      "ref": ["Chq / Ref No."],
+      "amount": ["Amount"],
+      "sign": ["Dr / Cr"]
+    },
+    "date_field": "date",
+    "remarks": [
+      {"field": "ref", "prefix": "Ref: ", "suffix": " ",
+       "skip": ["nan", "", "-"]},
+      {"field": "description"}
+    ],
+    "amount": {
+      "mode": "signed",
+      "field": "amount",
+      "sign_field": "sign",
+      "credit_values": ["CR"]
+    }
+  },
+  "WALLET": {
+    "file": {
+      "header": {
+        "mode": "detect",
+        "match": ["date", "note", "category", "amount"],
+        "min_matches": 3
+      }
+    },
+    "columns": {
+      "date": ["date"],
+      "note": ["note"],
+      "category": ["category"],
+      "amount": ["amount"]
+    },
+    "date_field": "date",
+    "remarks": [
+      {"field": "category", "suffix": ": ", "skip": [""]},
+      {"field": "note"}
+    ],
+    "amount": {
+      "mode": "direct",
+      "field": "amount"
+    }
+  }
+}

statement-parser 0.0.14__tar.gz → 0.1.0__tar.gz

statement-parser 0.0.14tar.gz → 0.1.0tar.gz