PyPI - parqv - Versions diffs - 0.2.1__tar.gz → 0.3.0__tar.gz - Mend

parqv 0.2.1tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (41) hide show

{parqv-0.2.1/src/parqv.egg-info → parqv-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: parqv
-Version: 0.2.1
+Version: 0.3.0
 Summary: An interactive Python TUI for visualizing, exploring, and analyzing files directly in your terminal.
 Author-email: Sangmin Yoon <sanspareilsmyn@gmail.com>
 License-Expression: Apache-2.0
@@ -23,14 +23,13 @@ Dynamic: license-file
 ---
-**Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | *(More planned!)*
+**Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | ✅ **CSV / TSV** | *(More planned!)*
 ---
-**`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** Initially supporting Parquet and JSON, `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
+**`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
 ## 💻 Demo
 ![parqv.gif](assets/parqv.gif)
 *(Demo shows Parquet features; UI adapts for other formats)*
@@ -47,7 +46,7 @@ Dynamic: license-file
     *   **🔌 Extensible:** Designed with a handler interface to easily add support for more file formats in the future (like CSV, Arrow IPC, etc.).
 ## ✨ Features (TUI Mode)
-*   **Multi-Format Support:** Currently supports **Parquet** (`.parquet`) and **JSON/JSON Lines** (`.json`, `.ndjson`). Run `parqv <your_file.{parquet,json,ndjson}>`.
+*   **Multi-Format Support:** Now supports **Parquet** (`.parquet`), **JSON/JSON Lines** (`.json`, `.ndjson`), and **CSV/TSV** (`.csv`, `.tsv`). Run `parqv <your_file.{parquet,json,ndjson,csv,tsv}>`.
 *   **Metadata Panel:** Displays key file information (path, format, size, total rows, column count, etc.). *Fields may vary slightly depending on the file format.*
 *   **Schema Explorer:**
     *   Interactive list view of columns.

{parqv-0.2.1 → parqv-0.3.0}/README.md RENAMED Viewed

@@ -7,14 +7,13 @@
 ---
-**Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | *(More planned!)*
+**Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | ✅ **CSV / TSV** | *(More planned!)*
 ---
-**`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** Initially supporting Parquet and JSON, `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
+**`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
 ## 💻 Demo
 ![parqv.gif](assets/parqv.gif)
 *(Demo shows Parquet features; UI adapts for other formats)*
@@ -31,7 +30,7 @@
     *   **🔌 Extensible:** Designed with a handler interface to easily add support for more file formats in the future (like CSV, Arrow IPC, etc.).
 ## ✨ Features (TUI Mode)
-*   **Multi-Format Support:** Currently supports **Parquet** (`.parquet`) and **JSON/JSON Lines** (`.json`, `.ndjson`). Run `parqv <your_file.{parquet,json,ndjson}>`.
+*   **Multi-Format Support:** Now supports **Parquet** (`.parquet`), **JSON/JSON Lines** (`.json`, `.ndjson`), and **CSV/TSV** (`.csv`, `.tsv`). Run `parqv <your_file.{parquet,json,ndjson,csv,tsv}>`.
 *   **Metadata Panel:** Displays key file information (path, format, size, total rows, column count, etc.). *Fields may vary slightly depending on the file format.*
 *   **Schema Explorer:**
     *   Interactive list view of columns.

{parqv-0.2.1 → parqv-0.3.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "parqv"
-version = "0.2.1"
+version = "0.3.0"
 description = "An interactive Python TUI for visualizing, exploring, and analyzing files directly in your terminal."
 readme = "README.md"
 requires-python = ">=3.10"

{parqv-0.2.1 → parqv-0.3.0}/src/parqv/core/config.py RENAMED Viewed

@@ -9,7 +9,8 @@ from pathlib import Path
 SUPPORTED_EXTENSIONS: Dict[str, str] = {
     ".parquet": "parquet",
     ".json": "json",
-    ".ndjson": "json"
+    ".ndjson": "json",
+    ".csv": "csv"
 }
 # Application constants

{parqv-0.2.1 → parqv-0.3.0}/src/parqv/core/handler_factory.py RENAMED Viewed

@@ -5,7 +5,7 @@ Handler factory for creating appropriate data handlers based on file type.
 from pathlib import Path
 from typing import Optional
-from ..data_sources import DataHandler, DataHandlerError, ParquetHandler, JsonHandler
+from ..data_sources import DataHandler, DataHandlerError, ParquetHandler, JsonHandler, CsvHandler
 from .logging import get_logger
 log = get_logger(__name__)
@@ -23,6 +23,7 @@ class HandlerFactory:
     _HANDLER_REGISTRY = {
         "parquet": ParquetHandler,
         "json": JsonHandler,
+        "csv": CsvHandler,
     }
     @classmethod

{parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/__init__.py RENAMED Viewed

@@ -23,6 +23,8 @@ from .formats import (
     ParquetHandlerError,
     JsonHandler,
     JsonHandlerError,
+    CsvHandler,
+    CsvHandlerError,
 )
 __all__ = [
@@ -41,4 +43,6 @@ __all__ = [
     "ParquetHandlerError",
     "JsonHandler",
     "JsonHandlerError",
+    "CsvHandler",
+    "CsvHandlerError",
 ]

{parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/formats/__init__.py RENAMED Viewed

@@ -4,6 +4,7 @@ Format-specific data handlers for parqv.
 from .parquet import ParquetHandler, ParquetHandlerError
 from .json import JsonHandler, JsonHandlerError
+from .csv import CsvHandler, CsvHandlerError
 __all__ = [
     # Parquet format
@@ -13,4 +14,8 @@ __all__ = [
     # JSON format
     "JsonHandler",
     "JsonHandlerError",
+    # CSV format
+    "CsvHandler",
+    "CsvHandlerError",
 ]

parqv-0.3.0/src/parqv/data_sources/formats/csv.py ADDED Viewed

@@ -0,0 +1,460 @@
+"""
+CSV file handler for parqv data sources.
+"""
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+import pandas as pd
+from ..base import DataHandler, DataHandlerError
+class CsvHandlerError(DataHandlerError):
+    """Custom exception for CSV handling errors."""
+    pass
+class CsvHandler(DataHandler):
+    """
+    Handles CSV file interactions using pandas.
+    Provides methods to access metadata, schema, data preview, and column statistics
+    for CSV files using pandas DataFrame operations.
+    """
+    def __init__(self, file_path: Path):
+        """
+        Initialize the CsvHandler by validating the path and reading the CSV file.
+        Args:
+            file_path: Path to the CSV file.
+        Raises:
+            CsvHandlerError: If the file is not found, not a file, or cannot be read.
+        """
+        super().__init__(file_path)
+        self.df: Optional[pd.DataFrame] = None
+        self._original_dtypes: Optional[Dict[str, str]] = None
+        try:
+            # Validate file existence
+            if not self.file_path.is_file():
+                raise FileNotFoundError(f"CSV file not found or is not a regular file: {self.file_path}")
+            # Read the CSV file with pandas
+            self._read_csv_file()
+            self.logger.info(f"Successfully initialized CsvHandler for: {self.file_path.name}")
+        except FileNotFoundError as fnf_e:
+            self.logger.error(f"File not found during CsvHandler initialization: {fnf_e}")
+            raise CsvHandlerError(str(fnf_e)) from fnf_e
+        except pd.errors.EmptyDataError as empty_e:
+            self.logger.error(f"CSV file is empty: {empty_e}")
+            raise CsvHandlerError(f"CSV file '{self.file_path.name}' is empty") from empty_e
+        except pd.errors.ParserError as parse_e:
+            self.logger.error(f"CSV parsing error: {parse_e}")
+            raise CsvHandlerError(f"Failed to parse CSV file '{self.file_path.name}': {parse_e}") from parse_e
+        except Exception as e:
+            self.logger.exception(f"Unexpected error initializing CsvHandler for {self.file_path.name}")
+            raise CsvHandlerError(f"Failed to initialize CSV handler '{self.file_path.name}': {e}") from e
+    def _read_csv_file(self) -> None:
+        """Read the CSV file using pandas with appropriate settings."""
+        try:
+            # Read CSV with automatic type inference
+            self.df = pd.read_csv(
+                self.file_path,
+                # Basic settings
+                encoding='utf-8',
+                # Handle various separators automatically
+                sep=None,  # Let pandas auto-detect
+                engine='python',  # More flexible parsing
+                # Preserve original string representation for better type info
+                dtype=str,  # Read everything as string first
+                na_values=['', 'NULL', 'null', 'None', 'N/A', 'n/a', 'NaN', 'nan'],
+                keep_default_na=True,
+            )
+            # Store original dtypes before conversion
+            self._original_dtypes = {col: 'string' for col in self.df.columns}
+            # Try to infer better types
+            self._infer_types()
+            self.logger.debug(f"Successfully read CSV with shape: {self.df.shape}")
+        except UnicodeDecodeError:
+            # Try with different encodings
+            for encoding in ['latin1', 'cp1252', 'iso-8859-1']:
+                try:
+                    self.logger.warning(f"Trying encoding: {encoding}")
+                    self.df = pd.read_csv(
+                        self.file_path,
+                        encoding=encoding,
+                        sep=None,
+                        engine='python',
+                        dtype=str,
+                        na_values=['', 'NULL', 'null', 'None', 'N/A', 'n/a', 'NaN', 'nan'],
+                        keep_default_na=True,
+                    )
+                    self._original_dtypes = {col: 'string' for col in self.df.columns}
+                    self._infer_types()
+                    self.logger.info(f"Successfully read CSV with encoding: {encoding}")
+                    break
+                except UnicodeDecodeError:
+                    continue
+            else:
+                raise CsvHandlerError(f"Could not decode CSV file with any common encoding")
+    def _infer_types(self) -> None:
+        """Infer appropriate data types for columns."""
+        if self.df is None:
+            return
+        for col in self.df.columns:
+            # Try to convert to numeric
+            numeric_converted = pd.to_numeric(self.df[col], errors='coerce')
+            if not numeric_converted.isna().all():
+                # If most values can be converted to numeric, use numeric type
+                non_na_original = self.df[col].notna().sum()
+                non_na_converted = numeric_converted.notna().sum()
+                if non_na_converted / max(non_na_original, 1) > 0.8:  # 80% conversion success
+                    self.df[col] = numeric_converted
+                    if (numeric_converted == numeric_converted.astype('Int64', errors='ignore')).all():
+                        self._original_dtypes[col] = 'integer'
+                    else:
+                        self._original_dtypes[col] = 'float'
+                    continue
+            # Try to convert to datetime
+            try:
+                datetime_converted = pd.to_datetime(self.df[col], errors='coerce', infer_datetime_format=True)
+                if not datetime_converted.isna().all():
+                    non_na_original = self.df[col].notna().sum()
+                    non_na_converted = datetime_converted.notna().sum()
+                    if non_na_converted / max(non_na_original, 1) > 0.8:  # 80% conversion success
+                        self.df[col] = datetime_converted
+                        self._original_dtypes[col] = 'datetime'
+                        continue
+            except (ValueError, TypeError):
+                pass
+            # Try to convert to boolean
+            bool_values = self.df[col].str.lower().isin(['true', 'false', 't', 'f', '1', '0', 'yes', 'no', 'y', 'n'])
+            if bool_values.sum() / len(self.df[col]) > 0.8:
+                bool_mapping = {
+                    'true': True, 'false': False, 't': True, 'f': False,
+                    '1': True, '0': False, 'yes': True, 'no': False,
+                    'y': True, 'n': False
+                }
+                self.df[col] = self.df[col].str.lower().map(bool_mapping)
+                self._original_dtypes[col] = 'boolean'
+                continue
+            # Keep as string
+            self._original_dtypes[col] = 'string'
+    def close(self) -> None:
+        """Close and cleanup resources (CSV data is held in memory)."""
+        if self.df is not None:
+            self.logger.info(f"Closed CSV handler for: {self.file_path.name}")
+            self.df = None
+            self._original_dtypes = None
+    def get_metadata_summary(self) -> Dict[str, Any]:
+        """
+        Get a summary dictionary of the CSV file's metadata.
+        Returns:
+            A dictionary containing metadata like file path, format, row count, columns, size.
+        """
+        if self.df is None:
+            return {"error": "CSV data not loaded or handler closed."}
+        try:
+            file_size = self.file_path.stat().st_size
+            size_str = self.format_size(file_size)
+        except Exception as e:
+            self.logger.warning(f"Could not get file size for {self.file_path}: {e}")
+            size_str = "N/A"
+        # Create a well-structured metadata summary
+        summary = {
+            "File Information": {
+                "Path": str(self.file_path),
+                "Format": "CSV",
+                "Size": size_str
+            },
+            "Data Structure": {
+                "Total Rows": f"{len(self.df):,}",
+                "Total Columns": f"{len(self.df.columns):,}",
+                "Memory Usage": f"{self.df.memory_usage(deep=True).sum():,} bytes"
+            },
+            "Column Types Summary": self._get_column_types_summary()
+        }
+        return summary
+    def _get_column_types_summary(self) -> Dict[str, int]:
+        """Get a summary of column types in the CSV data."""
+        if self.df is None or self._original_dtypes is None:
+            return {}
+        type_counts = {}
+        for col_type in self._original_dtypes.values():
+            type_counts[col_type] = type_counts.get(col_type, 0) + 1
+        # Format for better display
+        formatted_summary = {}
+        type_labels = {
+            'string': 'Text Columns',
+            'integer': 'Integer Columns',
+            'float': 'Numeric Columns',
+            'datetime': 'Date/Time Columns',
+            'boolean': 'Boolean Columns'
+        }
+        for type_key, count in type_counts.items():
+            label = type_labels.get(type_key, f'{type_key.title()} Columns')
+            formatted_summary[label] = f"{count:,}"
+        return formatted_summary
+    def get_schema_data(self) -> Optional[List[Dict[str, Any]]]:
+        """
+        Get the schema of the CSV data.
+        Returns:
+            A list of dictionaries describing columns (name, type, nullable),
+            or None if schema couldn't be determined.
+        """
+        if self.df is None:
+            self.logger.warning("DataFrame is not available for schema data")
+            return None
+        schema_list = []
+        for col in self.df.columns:
+            try:
+                # Get the inferred type
+                col_type = self._original_dtypes.get(col, 'string')
+                # Check for null values
+                has_nulls = self.df[col].isna().any()
+                schema_list.append({
+                    "name": str(col),
+                    "type": col_type,
+                    "nullable": bool(has_nulls)
+                })
+            except Exception as e:
+                self.logger.error(f"Error processing column '{col}' for schema data: {e}")
+                schema_list.append({
+                    "name": str(col),
+                    "type": f"[Error: {e}]",
+                    "nullable": None
+                })
+        return schema_list
+    def get_data_preview(self, num_rows: int = 50) -> Optional[pd.DataFrame]:
+        """
+        Fetch a preview of the data.
+        Args:
+            num_rows: The maximum number of rows to fetch.
+        Returns:
+            A pandas DataFrame with preview data, an empty DataFrame if no data,
+            or a DataFrame with an 'error' column on failure.
+        """
+        if self.df is None:
+            self.logger.warning("CSV data not available for preview")
+            return pd.DataFrame({"error": ["CSV data not loaded or handler closed."]})
+        try:
+            if self.df.empty:
+                self.logger.info("CSV file has no data rows")
+                return pd.DataFrame(columns=self.df.columns)
+            # Return first num_rows
+            preview_df = self.df.head(num_rows).copy()
+            self.logger.info(f"Generated preview of {len(preview_df)} rows for {self.file_path.name}")
+            return preview_df
+        except Exception as e:
+            self.logger.exception(f"Error generating data preview from CSV file: {self.file_path.name}")
+            return pd.DataFrame({"error": [f"Failed to generate preview: {e}"]})
+    def get_column_stats(self, column_name: str) -> Dict[str, Any]:
+        """
+        Calculate and return statistics for a specific column.
+        Args:
+            column_name: The name of the column.
+        Returns:
+            A dictionary containing column statistics or error information.
+        """
+        if self.df is None:
+            return self._create_stats_result(
+                column_name, "Unknown", {}, error="CSV data not loaded or handler closed."
+            )
+        if column_name not in self.df.columns:
+            return self._create_stats_result(
+                column_name, "Unknown", {}, error=f"Column '{column_name}' not found in CSV data."
+            )
+        try:
+            col_series = self.df[column_name]
+            col_type = self._original_dtypes.get(column_name, 'string')
+            # Basic counts
+            total_count = len(col_series)
+            null_count = col_series.isna().sum()
+            valid_count = total_count - null_count
+            null_percentage = (null_count / total_count * 100) if total_count > 0 else 0
+            stats = {
+                "Total Count": f"{total_count:,}",
+                "Valid Count": f"{valid_count:,}",
+                "Null Count": f"{null_count:,}",
+                "Null Percentage": f"{null_percentage:.2f}%"
+            }
+            # Type-specific statistics
+            if valid_count > 0:
+                valid_series = col_series.dropna()
+                # Distinct count (always applicable)
+                distinct_count = valid_series.nunique()
+                stats["Distinct Count"] = f"{distinct_count:,}"
+                if col_type in ['integer', 'float']:
+                    # Numeric statistics
+                    stats.update(self._calculate_numeric_stats_pandas(valid_series))
+                elif col_type == 'datetime':
+                    # Datetime statistics
+                    stats.update(self._calculate_datetime_stats_pandas(valid_series))
+                elif col_type == 'boolean':
+                    # Boolean statistics
+                    stats.update(self._calculate_boolean_stats_pandas(valid_series))
+                elif col_type == 'string':
+                    # String statistics (min/max by alphabetical order)
+                    stats.update(self._calculate_string_stats_pandas(valid_series))
+            return self._create_stats_result(column_name, col_type, stats, nullable=null_count > 0)
+        except Exception as e:
+            self.logger.exception(f"Error calculating stats for column '{column_name}'")
+            return self._create_stats_result(
+                column_name, "Unknown", {}, error=f"Failed to calculate statistics: {e}"
+            )
+    def _calculate_numeric_stats_pandas(self, series: pd.Series) -> Dict[str, Any]:
+        """Calculate statistics for numeric columns using pandas."""
+        stats = {}
+        try:
+            stats["Min"] = series.min()
+            stats["Max"] = series.max()
+            stats["Mean"] = f"{series.mean():.4f}"
+            stats["Median (50%)"] = series.median()
+            stats["StdDev"] = f"{series.std():.4f}"
+            # Add histogram data for visualization
+            try:
+                # Sample data if too large for performance
+                sample_size = min(10000, len(series))
+                if len(series) > sample_size:
+                    sampled_series = series.sample(n=sample_size, random_state=42)
+                else:
+                    sampled_series = series
+                # Convert to list for histogram
+                clean_data = sampled_series.tolist()
+                if len(clean_data) > 10:  # Only create histogram if we have enough data
+                    stats["_histogram_data"] = clean_data
+                    stats["_data_type"] = "numeric"
+            except Exception as e:
+                self.logger.warning(f"Failed to prepare histogram data: {e}")
+        except Exception as e:
+            self.logger.warning(f"Error calculating numeric stats: {e}")
+            stats["Calculation Error"] = str(e)
+        return stats
+    def _calculate_datetime_stats_pandas(self, series: pd.Series) -> Dict[str, Any]:
+        """Calculate statistics for datetime columns using pandas."""
+        stats = {}
+        try:
+            stats["Min"] = series.min()
+            stats["Max"] = series.max()
+            # Calculate time range
+            time_range = series.max() - series.min()
+            stats["Range"] = str(time_range)
+        except Exception as e:
+            self.logger.warning(f"Error calculating datetime stats: {e}")
+            stats["Calculation Error"] = str(e)
+        return stats
+    def _calculate_boolean_stats_pandas(self, series: pd.Series) -> Dict[str, Any]:
+        """Calculate statistics for boolean columns using pandas."""
+        stats = {}
+        try:
+            value_counts = series.value_counts()
+            stats["True Count"] = f"{value_counts.get(True, 0):,}"
+            stats["False Count"] = f"{value_counts.get(False, 0):,}"
+            if len(value_counts) > 0:
+                true_pct = (value_counts.get(True, 0) / len(series) * 100)
+                stats["True Percentage"] = f"{true_pct:.2f}%"
+        except Exception as e:
+            self.logger.warning(f"Error calculating boolean stats: {e}")
+            stats["Calculation Error"] = str(e)
+        return stats
+    def _calculate_string_stats_pandas(self, series: pd.Series) -> Dict[str, Any]:
+        """Calculate statistics for string columns using pandas."""
+        stats = {}
+        try:
+            # Only min/max for strings (alphabetical order)
+            stats["Min"] = str(series.min())
+            stats["Max"] = str(series.max())
+            # Most common values
+            value_counts = series.value_counts().head(5)
+            if len(value_counts) > 0:
+                top_values = {}
+                for value, count in value_counts.items():
+                    top_values[str(value)] = f"{count:,}"
+                stats["Top Values"] = top_values
+        except Exception as e:
+            self.logger.warning(f"Error calculating string stats: {e}")
+            stats["Calculation Error"] = str(e)
+        return stats
+    def _create_stats_result(
+            self,
+            column_name: str,
+            col_type: str,
+            calculated_stats: Dict[str, Any],
+            nullable: Optional[bool] = None,
+            error: Optional[str] = None,
+            message: Optional[str] = None
+    ) -> Dict[str, Any]:
+        """Package the stats results consistently."""
+        return {
+            "column": column_name,
+            "type": col_type,
+            "nullable": nullable if nullable is not None else "Unknown",
+            "calculated": calculated_stats or {},
+            "error": error,
+            "message": message,
+        }

{parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/formats/json.py RENAMED Viewed

@@ -303,6 +303,12 @@ class JsonHandler(DataHandler):
                     # SUMMARIZE puts results in the first row
                     stats = self._format_summarize_stats(summarize_df.iloc[0])
+                    # Add histogram data for numeric columns
+                    try:
+                        self._add_histogram_data_if_numeric(stats, safe_column_name)
+                    except Exception as hist_e:
+                        self.logger.warning(f"Failed to add histogram data for {column_name}: {hist_e}")
         except duckdb.Error as db_err:
             self.logger.exception(f"DuckDB Error calculating statistics for column '{column_name}': {db_err}")
             error_msg = f"DuckDB calculation failed: {db_err}"
@@ -403,6 +409,37 @@ class JsonHandler(DataHandler):
         return stats
+    def _add_histogram_data_if_numeric(self, stats: Dict[str, Any], safe_column_name: str) -> None:
+        """Add histogram data for numeric columns by sampling from DuckDB."""
+        # Check if this looks like numeric data (has Mean, Min, Max)
+        if not all(key in stats for key in ["Mean", "Min", "Max"]):
+            return
+        try:
+            # Sample data for histogram (limit to 10k samples for performance)
+            sample_query = f"""
+            SELECT {safe_column_name}
+            FROM "{self._view_name}"
+            WHERE {safe_column_name} IS NOT NULL
+            USING SAMPLE 10000
+            """
+            sample_df = self._db_conn.sql(sample_query).df()
+            if not sample_df.empty and len(sample_df) > 10:
+                # Extract the column data
+                column_data = sample_df.iloc[:, 0].tolist()
+                # Filter out any remaining nulls
+                clean_data = [val for val in column_data if val is not None]
+                if len(clean_data) > 10:
+                    stats["_histogram_data"] = clean_data
+                    stats["_data_type"] = "numeric"
+        except Exception as e:
+            self.logger.warning(f"Failed to sample data for histogram: {e}")
     def _create_stats_result(
             self,
             column_name: str,

{parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/formats/parquet.py RENAMED Viewed

@@ -301,7 +301,8 @@ class ParquetHandler(DataHandler):
                         message = f"Statistics calculation not implemented for type '{self._format_pyarrow_type(col_type)}'."
                 except Exception as calc_err:
-                    self.logger.exception(f"Error during type-specific calculation for column '{column_name}': {calc_err}")
+                    self.logger.exception(
+                        f"Error during type-specific calculation for column '{column_name}': {calc_err}")
                     error_msg = f"Calculation error for type {field.type}: {calc_err}"
                     calculated_stats["Calculation Error"] = str(calc_err)  # Add specific error key
@@ -422,6 +423,32 @@ class ParquetHandler(DataHandler):
             variance_val, err_var = self._safe_compute(pc.variance, column_data, ddof=1)
             stats["Variance"] = f"{variance_val:.4f}" if variance_val is not None and err_var is None else (
                     err_var or "N/A")
+        distinct_val, err = self._safe_compute(pc.count_distinct, column_data)
+        stats["Distinct Count"] = f"{distinct_val:,}" if distinct_val is not None and err is None else (err or "N/A")
+        # Add histogram data for visualization
+        try:
+            # Convert to Python list for histogram calculation (sample if too large)
+            data_length = len(column_data)
+            sample_size = min(10000, data_length)  # Limit to 10k samples for performance
+            if data_length > sample_size:
+                # Sample the data
+                import random
+                indices = sorted(random.sample(range(data_length), sample_size))
+                sampled_data = [column_data[i].as_py() for i in indices]
+            else:
+                sampled_data = column_data.to_pylist()
+            # Filter out None values
+            clean_data = [val for val in sampled_data if val is not None]
+            if len(clean_data) > 10:  # Only create histogram if we have enough data
+                stats["_histogram_data"] = clean_data
+                stats["_data_type"] = "numeric"
+        except Exception as e:
+            self.logger.warning(f"Failed to prepare histogram data: {e}")
         return stats

{parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/utils/__init__.py RENAMED Viewed

@@ -4,10 +4,16 @@ Utility functions for parqv views.
 from .data_formatters import format_metadata_for_display, format_value_for_display
 from .stats_formatters import format_stats_for_display, format_column_info
+from .visualization import create_text_histogram, should_show_histogram
 __all__ = [
+    # Data formatting
     "format_metadata_for_display",
-    "format_value_for_display",
+    "format_value_for_display",
     "format_stats_for_display",
     "format_column_info",
-]
+    # Visualization
+    "create_text_histogram",
+    "should_show_histogram",
+]

{parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/utils/data_formatters.py RENAMED Viewed

@@ -28,15 +28,21 @@ def format_metadata_for_display(metadata: Dict[str, Any]) -> Dict[str, Any]:
     # Format specific known fields with better presentation
     field_formatters = {
         "File Path": lambda x: str(x),
+        "Path": lambda x: str(x),
         "Format": lambda x: str(x).upper(),
         "Total Rows": lambda x: _format_number(x),
+        "Total Columns": lambda x: _format_number(x),
         "Columns": lambda x: _format_number(x),
         "Size": lambda x: _format_size_if_bytes(x),
+        "Memory Usage": lambda x: _format_size_if_bytes(x),
         "DuckDB View": lambda x: f"`{x}`" if x else "N/A",
     }
     for key, value in metadata.items():
-        if key in field_formatters:
+        if isinstance(value, dict):
+            # Handle nested dictionaries (like grouped metadata)
+            formatted[key] = _format_nested_metadata(value, field_formatters)
+        elif key in field_formatters:
             formatted[key] = field_formatters[key](value)
         else:
             formatted[key] = format_value_for_display(value)
@@ -44,6 +50,22 @@ def format_metadata_for_display(metadata: Dict[str, Any]) -> Dict[str, Any]:
     return formatted
+def _format_nested_metadata(nested_dict: Dict[str, Any], field_formatters: Dict) -> Dict[str, Any]:
+    """Format nested metadata dictionaries."""
+    formatted_nested = {}
+    for key, value in nested_dict.items():
+        if isinstance(value, dict):
+            # Handle further nesting if needed
+            formatted_nested[key] = _format_nested_metadata(value, field_formatters)
+        elif key in field_formatters:
+            formatted_nested[key] = field_formatters[key](value)
+        else:
+            formatted_nested[key] = format_value_for_display(value)
+    return formatted_nested
 def format_value_for_display(value: Any) -> str:
     """
     Format a single value for display in the UI.

{parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/utils/stats_formatters.py RENAMED Viewed

@@ -6,6 +6,8 @@ from typing import Any, Dict, List, Union
 from rich.text import Text
+from .visualization import create_text_histogram, should_show_histogram
 def format_stats_for_display(stats_data: Dict[str, Any]) -> List[Union[str, Text]]:
     """
@@ -21,7 +23,7 @@ def format_stats_for_display(stats_data: Dict[str, Any]) -> List[Union[str, Text
         return [Text.from_markup("[red]No statistics data available.[/red]")]
     lines: List[Union[str, Text]] = []
     # Extract basic column information
     col_name = stats_data.get("column", "N/A")
     col_type = stats_data.get("type", "Unknown")
@@ -29,22 +31,22 @@ def format_stats_for_display(stats_data: Dict[str, Any]) -> List[Union[str, Text
     # Format column header
     lines.extend(_format_column_header(col_name, col_type, nullable_val))
     # Handle calculation errors
     calc_error = stats_data.get("error")
     if calc_error:
         lines.extend(_format_error_section(calc_error))
     # Add informational messages
     message = stats_data.get("message")
     if message:
         lines.extend(_format_message_section(message))
     # Format calculated statistics
     calculated = stats_data.get("calculated")
     if calculated:
         lines.extend(_format_calculated_stats(calculated, has_error=bool(calc_error)))
     return lines
@@ -72,13 +74,13 @@ def _format_column_header(col_name: str, col_type: str, nullable_val: Any) -> Li
         nullable_str = "Required"
     else:
         nullable_str = "Unknown Nullability"
     lines = [
         Text.assemble(("Column: ", "bold"), f"`{col_name}`"),
         Text.assemble(("Type:   ", "bold"), f"{col_type} ({nullable_str})"),
         "─" * (len(col_name) + len(col_type) + 20)
     ]
     return lines
@@ -102,7 +104,7 @@ def _format_message_section(message: str) -> List[Union[str, Text]]:
 def _format_calculated_stats(calculated: Dict[str, Any], has_error: bool = False) -> List[Union[str, Text]]:
     """Format the calculated statistics section."""
     lines = [Text("Calculated Statistics:", style="bold")]
     # Define the order of statistics to display
     stats_order = [
         "Total Count", "Valid Count", "Null Count", "Null Percentage",
@@ -111,32 +113,37 @@ def _format_calculated_stats(calculated: Dict[str, Any], has_error: bool = False
         "True Count", "False Count",
         "Value Counts"
     ]
     found_stats = False
     for key in stats_order:
         if key in calculated:
             found_stats = True
             value = calculated[key]
             lines.extend(_format_single_stat(key, value))
-    # Add any additional stats not in the predefined order
+    # Add any additional stats not in the predefined order (excluding internal histogram data)
     for key, value in calculated.items():
-        if key not in stats_order:
+        if key not in stats_order and not key.startswith('_'):  # Skip internal fields
             found_stats = True
             lines.extend(_format_single_stat(key, value))
     # Handle case where no stats were found
     if not found_stats and not has_error:
         lines.append(Text("  (No specific stats calculated for this type)", style="dim"))
+    # Add histogram visualization for numeric data
+    if "_histogram_data" in calculated and "_data_type" in calculated:
+        if calculated["_data_type"] == "numeric":
+            lines.extend(_format_histogram_visualization(calculated))
     return lines
 def _format_single_stat(key: str, value: Any) -> List[Union[str, Text]]:
     """Format a single statistic entry."""
     lines = []
     if key == "Value Counts" and isinstance(value, dict):
         lines.append(f"  - {key}:")
         for sub_key, sub_val in value.items():
@@ -145,7 +152,7 @@ def _format_single_stat(key: str, value: Any) -> List[Union[str, Text]]:
     else:
         formatted_value = _format_stat_value(value)
         lines.append(f"  - {key}: {formatted_value}")
     return lines
@@ -157,4 +164,57 @@ def _format_stat_value(value: Any) -> str:
         else:
             return f"{value:,.4f}"
     else:
-        return str(value)
+        return str(value)
+def _format_histogram_visualization(calculated: Dict[str, Any]) -> List[Union[str, Text]]:
+    """Format histogram visualization for numeric data."""
+    lines = []
+    try:
+        histogram_data = calculated.get("_histogram_data", [])
+        if not histogram_data:
+            return lines
+        # Check if we should show histogram
+        distinct_count_str = calculated.get("Distinct Count", "0")
+        try:
+            # Remove commas and convert to int
+            distinct_count = int(distinct_count_str.replace(",", ""))
+        except (ValueError, AttributeError):
+            distinct_count = len(set(histogram_data))
+        total_count = len(histogram_data)
+        if should_show_histogram("numeric", distinct_count, total_count):
+            lines.append("")
+            lines.append(Text("Data Distribution:", style="bold cyan"))
+            # Create histogram
+            histogram_lines = create_text_histogram(
+                data=histogram_data,
+                bins=15,
+                width=50,
+                height=8,
+                title=None
+            )
+            # Add each histogram line
+            for line in histogram_lines:
+                if isinstance(line, str):
+                    lines.append(f"  {line}")
+                else:
+                    lines.append(line)
+        else:
+            # For discrete data, show a note
+            if distinct_count < total_count * 0.1:  # Less than 10% unique values
+                lines.append("")
+                lines.append(Text("Note: Data appears to be discrete/categorical", style="dim italic"))
+                lines.append(Text("(Histogram not shown for discrete values)", style="dim italic"))
+    except Exception as e:
+        # Don't fail the whole stats display if histogram fails
+        lines.append("")
+        lines.append(Text(f"Note: Could not generate histogram: {e}", style="dim red"))
+    return lines

parqv-0.3.0/src/parqv/views/utils/visualization.py ADDED Viewed

@@ -0,0 +1,204 @@
+"""
+Visualization utilities for parqv views.
+Provides text-based data visualization functions like ASCII histograms.
+"""
+import math
+from typing import List, Union, Optional
+TICK_CHARS = [' ', '▂', '▃', '▄', '▅', '▆', '▇', '█']
+def create_text_histogram(
+        data: List[Union[int, float]],
+        bins: int = 15,
+        width: int = 60,
+        height: int = 8,
+        title: Optional[str] = None
+) -> List[str]:
+    """
+    Create a professional, text-based histogram from numerical data.
+    Args:
+        data: List of numerical values.
+        bins: The number of bins for the histogram.
+        width: The total character width of the output histogram.
+        height: The maximum height of the histogram bars in lines.
+        title: An optional title for the histogram.
+    Returns:
+        A list of strings representing the histogram, ready for printing.
+    """
+    if not data:
+        return ["(No data available for histogram)"]
+    # 1. Sanitize the input data
+    clean_data = [float(val) for val in data if isinstance(val, (int, float)) and math.isfinite(val)]
+    if not clean_data:
+        return ["(No valid numerical data to plot)"]
+    min_val, max_val = min(clean_data), max(clean_data)
+    if min_val == max_val:
+        return [f"(All values are identical: {_format_number(min_val)})"]
+    # 2. Create bins and count frequencies
+    # Add a small epsilon to the range to ensure max_val falls into the last bin
+    epsilon = (max_val - min_val) / 1e9
+    value_range = (max_val - min_val) + epsilon
+    bin_width = value_range / bins
+    bin_counts = [0] * bins
+    for value in clean_data:
+        bin_index = int((value - min_val) / bin_width)
+        bin_counts[bin_index] += 1
+    # 3. Render the histogram
+    return _render_histogram(
+        bin_counts=bin_counts,
+        min_val=min_val,
+        max_val=max_val,
+        width=width,
+        height=height,
+        title=title
+    )
+def _render_histogram(
+        bin_counts: List[int],
+        min_val: float,
+        max_val: float,
+        width: int,
+        height: int,
+        title: Optional[str]
+) -> List[str]:
+    """
+    Internal function to render the histogram components into ASCII art.
+    """
+    lines = []
+    if title:
+        lines.append(title.center(width))
+    max_count = max(bin_counts) if bin_counts else 0
+    if max_count == 0:
+        return lines + ["(No data falls within histogram bins)"]
+    # --- Layout Calculations ---
+    y_axis_width = len(str(max_count))
+    plot_width = width - y_axis_width - 3  # Reserve space for "| " and axis
+    if plot_width <= 0:
+        return ["(Terminal width too narrow to draw histogram)"]
+    # Resample the data bins to fit the available plot_width.
+    # This stretches or shrinks the histogram to match the screen space.
+    display_bins = []
+    num_data_bins = len(bin_counts)
+    for i in range(plot_width):
+        # Find the corresponding data bin for this screen column
+        data_bin_index = int(i * num_data_bins / plot_width)
+        display_bins.append(bin_counts[data_bin_index])
+    # --- Y-Axis and Bars (Top to Bottom) ---
+    for row in range(height, -1, -1):
+        line = ""
+        # Y-axis labels
+        if row == height:
+            line += f"{max_count:<{y_axis_width}} | "
+        elif row == 0:
+            line += f"{0:<{y_axis_width}} +-"
+        else:
+            line += " " * y_axis_width + " | "
+        # Bars - now iterate over the resampled display_bins
+        for count in display_bins:
+            # Scale current count to the available height
+            scaled_height = (count / max_count) * height
+            # Determine character based on height relative to current row
+            if scaled_height >= row:
+                line += TICK_CHARS[-1]  # Full block for the solid part of the bar
+            elif scaled_height > row - 1:
+                # This is the top of the bar, use a partial character
+                partial_index = int((scaled_height - row + 1) * (len(TICK_CHARS) - 1))
+                line += TICK_CHARS[max(0, partial_index)]
+            elif row == 0:
+                line += "-"  # X-axis line
+            else:
+                line += " "  # Empty space above the bar
+        lines.append(line)
+    # --- X-Axis Labels ---
+    x_axis_labels = _create_x_axis_labels(min_val, max_val, plot_width)
+    label_line = " " * (y_axis_width + 3) + x_axis_labels
+    lines.append(label_line)
+    return lines
+def _create_x_axis_labels(min_val: float, max_val: float, plot_width: int) -> str:
+    """Create a formatted string for the X-axis labels."""
+    min_label = _format_number(min_val)
+    max_label = _format_number(max_val)
+    available_width = plot_width - len(min_label) - len(max_label)
+    if available_width < 4:
+        return f"{min_label}{' ' * (plot_width - len(min_label) - len(max_label))}{max_label}"
+    mid_val = (min_val + max_val) / 2
+    mid_label = _format_number(mid_val)
+    spacing1 = (plot_width // 2) - len(min_label) - (len(mid_label) // 2)
+    spacing2 = (plot_width - (plot_width // 2)) - (len(mid_label) - (len(mid_label) // 2)) - len(max_label)
+    if spacing1 < 1 or spacing2 < 1:
+        return f"{min_label}{' ' * (plot_width - len(min_label) - len(max_label))}{max_label}"
+    return f"{min_label}{' ' * spacing1}{mid_label}{' ' * spacing2}{max_label}"
+def _format_number(value: float) -> str:
+    """Format a number nicely for display on an axis."""
+    if abs(value) < 1e-4 and value != 0:
+        return f"{value:.1e}"
+    if abs(value) >= 1e5:
+        return f"{value:.1e}"
+    if math.isclose(value, int(value)):
+        return str(int(value))
+    if abs(value) < 10:
+        return f"{value:.2f}"
+    if abs(value) < 100:
+        return f"{value:.1f}"
+    return str(int(value))
+def should_show_histogram(data_type: str, distinct_count: int, total_count: int) -> bool:
+    """
+    Determine if a histogram should be shown for this data.
+    This function uses a set of heuristics to decide if the data is
+    continuous enough to warrant a histogram visualization.
+    """
+    # 1. Type Check: Histograms are only meaningful for numeric data.
+    if 'numeric' not in data_type and 'integer' not in data_type and 'float' not in data_type:
+        return False
+    # 2. Data Volume Check: Don't render if there's too little data or no variation.
+    if total_count < 20 or distinct_count <= 1:
+        return False
+    # 3. Categorical Data Filter: If the number of distinct values is very low,
+    #    treat it as categorical data (e.g., ratings from 1-10, months 1-12).
+    if distinct_count < 15:
+        return False
+    # 4. High Cardinality Filter: If almost every value is unique (like an ID or index),
+    #    a histogram is not useful as most bars would have a height of 1.
+    distinct_ratio = distinct_count / total_count
+    if distinct_ratio > 0.95:
+        return False
+    # 5. Pass: If the data passes all the above filters, it is considered
+    #    sufficiently continuous to be visualized with a histogram.
+    return True

{parqv-0.2.1 → parqv-0.3.0/src/parqv.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: parqv
-Version: 0.2.1
+Version: 0.3.0
 Summary: An interactive Python TUI for visualizing, exploring, and analyzing files directly in your terminal.
 Author-email: Sangmin Yoon <sanspareilsmyn@gmail.com>
 License-Expression: Apache-2.0
@@ -23,14 +23,13 @@ Dynamic: license-file
 ---
-**Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | *(More planned!)*
+**Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | ✅ **CSV / TSV** | *(More planned!)*
 ---
-**`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** Initially supporting Parquet and JSON, `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
+**`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
 ## 💻 Demo
 ![parqv.gif](assets/parqv.gif)
 *(Demo shows Parquet features; UI adapts for other formats)*
@@ -47,7 +46,7 @@ Dynamic: license-file
     *   **🔌 Extensible:** Designed with a handler interface to easily add support for more file formats in the future (like CSV, Arrow IPC, etc.).
 ## ✨ Features (TUI Mode)
-*   **Multi-Format Support:** Currently supports **Parquet** (`.parquet`) and **JSON/JSON Lines** (`.json`, `.ndjson`). Run `parqv <your_file.{parquet,json,ndjson}>`.
+*   **Multi-Format Support:** Now supports **Parquet** (`.parquet`), **JSON/JSON Lines** (`.json`, `.ndjson`), and **CSV/TSV** (`.csv`, `.tsv`). Run `parqv <your_file.{parquet,json,ndjson,csv,tsv}>`.
 *   **Metadata Panel:** Displays key file information (path, format, size, total rows, column count, etc.). *Fields may vary slightly depending on the file format.*
 *   **Schema Explorer:**
     *   Interactive list view of columns.

{parqv-0.2.1 → parqv-0.3.0}/src/parqv.egg-info/SOURCES.txt RENAMED Viewed

@@ -21,6 +21,7 @@ src/parqv/data_sources/base/__init__.py
 src/parqv/data_sources/base/exceptions.py
 src/parqv/data_sources/base/handler.py
 src/parqv/data_sources/formats/__init__.py
+src/parqv/data_sources/formats/csv.py
 src/parqv/data_sources/formats/json.py
 src/parqv/data_sources/formats/parquet.py
 src/parqv/views/__init__.py
@@ -34,4 +35,5 @@ src/parqv/views/components/error_display.py
 src/parqv/views/components/loading_display.py
 src/parqv/views/utils/__init__.py
 src/parqv/views/utils/data_formatters.py
-src/parqv/views/utils/stats_formatters.py
+src/parqv/views/utils/stats_formatters.py
+src/parqv/views/utils/visualization.py