parqv 0.2.1__tar.gz → 0.3.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {parqv-0.2.1/src/parqv.egg-info → parqv-0.3.0}/PKG-INFO +4 -5
- {parqv-0.2.1 → parqv-0.3.0}/README.md +3 -4
- {parqv-0.2.1 → parqv-0.3.0}/pyproject.toml +1 -1
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/core/config.py +2 -1
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/core/handler_factory.py +2 -1
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/__init__.py +4 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/formats/__init__.py +5 -0
- parqv-0.3.0/src/parqv/data_sources/formats/csv.py +460 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/formats/json.py +37 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/formats/parquet.py +28 -1
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/utils/__init__.py +8 -2
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/utils/data_formatters.py +23 -1
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/utils/stats_formatters.py +78 -18
- parqv-0.3.0/src/parqv/views/utils/visualization.py +204 -0
- {parqv-0.2.1 → parqv-0.3.0/src/parqv.egg-info}/PKG-INFO +4 -5
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv.egg-info/SOURCES.txt +3 -1
- {parqv-0.2.1 → parqv-0.3.0}/LICENSE +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/setup.cfg +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/__init__.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/app.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/cli.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/core/__init__.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/core/file_utils.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/core/logging.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/base/__init__.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/base/exceptions.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/base/handler.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/parqv.css +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/__init__.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/base.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/components/__init__.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/components/enhanced_data_table.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/components/error_display.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/components/loading_display.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/data_view.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/metadata_view.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/schema_view.py +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv.egg-info/dependency_links.txt +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv.egg-info/entry_points.txt +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv.egg-info/requires.txt +0 -0
- {parqv-0.2.1 → parqv-0.3.0}/src/parqv.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: parqv
|
3
|
-
Version: 0.
|
3
|
+
Version: 0.3.0
|
4
4
|
Summary: An interactive Python TUI for visualizing, exploring, and analyzing files directly in your terminal.
|
5
5
|
Author-email: Sangmin Yoon <sanspareilsmyn@gmail.com>
|
6
6
|
License-Expression: Apache-2.0
|
@@ -23,14 +23,13 @@ Dynamic: license-file
|
|
23
23
|
|
24
24
|
---
|
25
25
|
|
26
|
-
**Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | *(More planned!)*
|
26
|
+
**Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | ✅ **CSV / TSV** | *(More planned!)*
|
27
27
|
|
28
28
|
---
|
29
29
|
|
30
|
-
**`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.**
|
30
|
+
**`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
|
31
31
|
|
32
32
|
## 💻 Demo
|
33
|
-
|
34
33
|

|
35
34
|
*(Demo shows Parquet features; UI adapts for other formats)*
|
36
35
|
|
@@ -47,7 +46,7 @@ Dynamic: license-file
|
|
47
46
|
* **🔌 Extensible:** Designed with a handler interface to easily add support for more file formats in the future (like CSV, Arrow IPC, etc.).
|
48
47
|
|
49
48
|
## ✨ Features (TUI Mode)
|
50
|
-
* **Multi-Format Support:**
|
49
|
+
* **Multi-Format Support:** Now supports **Parquet** (`.parquet`), **JSON/JSON Lines** (`.json`, `.ndjson`), and **CSV/TSV** (`.csv`, `.tsv`). Run `parqv <your_file.{parquet,json,ndjson,csv,tsv}>`.
|
51
50
|
* **Metadata Panel:** Displays key file information (path, format, size, total rows, column count, etc.). *Fields may vary slightly depending on the file format.*
|
52
51
|
* **Schema Explorer:**
|
53
52
|
* Interactive list view of columns.
|
@@ -7,14 +7,13 @@
|
|
7
7
|
|
8
8
|
---
|
9
9
|
|
10
|
-
**Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | *(More planned!)*
|
10
|
+
**Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | ✅ **CSV / TSV** | *(More planned!)*
|
11
11
|
|
12
12
|
---
|
13
13
|
|
14
|
-
**`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.**
|
14
|
+
**`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
|
15
15
|
|
16
16
|
## 💻 Demo
|
17
|
-
|
18
17
|

|
19
18
|
*(Demo shows Parquet features; UI adapts for other formats)*
|
20
19
|
|
@@ -31,7 +30,7 @@
|
|
31
30
|
* **🔌 Extensible:** Designed with a handler interface to easily add support for more file formats in the future (like CSV, Arrow IPC, etc.).
|
32
31
|
|
33
32
|
## ✨ Features (TUI Mode)
|
34
|
-
* **Multi-Format Support:**
|
33
|
+
* **Multi-Format Support:** Now supports **Parquet** (`.parquet`), **JSON/JSON Lines** (`.json`, `.ndjson`), and **CSV/TSV** (`.csv`, `.tsv`). Run `parqv <your_file.{parquet,json,ndjson,csv,tsv}>`.
|
35
34
|
* **Metadata Panel:** Displays key file information (path, format, size, total rows, column count, etc.). *Fields may vary slightly depending on the file format.*
|
36
35
|
* **Schema Explorer:**
|
37
36
|
* Interactive list view of columns.
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|
4
4
|
|
5
5
|
[project]
|
6
6
|
name = "parqv"
|
7
|
-
version = "0.
|
7
|
+
version = "0.3.0"
|
8
8
|
description = "An interactive Python TUI for visualizing, exploring, and analyzing files directly in your terminal."
|
9
9
|
readme = "README.md"
|
10
10
|
requires-python = ">=3.10"
|
@@ -5,7 +5,7 @@ Handler factory for creating appropriate data handlers based on file type.
|
|
5
5
|
from pathlib import Path
|
6
6
|
from typing import Optional
|
7
7
|
|
8
|
-
from ..data_sources import DataHandler, DataHandlerError, ParquetHandler, JsonHandler
|
8
|
+
from ..data_sources import DataHandler, DataHandlerError, ParquetHandler, JsonHandler, CsvHandler
|
9
9
|
from .logging import get_logger
|
10
10
|
|
11
11
|
log = get_logger(__name__)
|
@@ -23,6 +23,7 @@ class HandlerFactory:
|
|
23
23
|
_HANDLER_REGISTRY = {
|
24
24
|
"parquet": ParquetHandler,
|
25
25
|
"json": JsonHandler,
|
26
|
+
"csv": CsvHandler,
|
26
27
|
}
|
27
28
|
|
28
29
|
@classmethod
|
@@ -23,6 +23,8 @@ from .formats import (
|
|
23
23
|
ParquetHandlerError,
|
24
24
|
JsonHandler,
|
25
25
|
JsonHandlerError,
|
26
|
+
CsvHandler,
|
27
|
+
CsvHandlerError,
|
26
28
|
)
|
27
29
|
|
28
30
|
__all__ = [
|
@@ -41,4 +43,6 @@ __all__ = [
|
|
41
43
|
"ParquetHandlerError",
|
42
44
|
"JsonHandler",
|
43
45
|
"JsonHandlerError",
|
46
|
+
"CsvHandler",
|
47
|
+
"CsvHandlerError",
|
44
48
|
]
|
@@ -4,6 +4,7 @@ Format-specific data handlers for parqv.
|
|
4
4
|
|
5
5
|
from .parquet import ParquetHandler, ParquetHandlerError
|
6
6
|
from .json import JsonHandler, JsonHandlerError
|
7
|
+
from .csv import CsvHandler, CsvHandlerError
|
7
8
|
|
8
9
|
__all__ = [
|
9
10
|
# Parquet format
|
@@ -13,4 +14,8 @@ __all__ = [
|
|
13
14
|
# JSON format
|
14
15
|
"JsonHandler",
|
15
16
|
"JsonHandlerError",
|
17
|
+
|
18
|
+
# CSV format
|
19
|
+
"CsvHandler",
|
20
|
+
"CsvHandlerError",
|
16
21
|
]
|
@@ -0,0 +1,460 @@
|
|
1
|
+
"""
|
2
|
+
CSV file handler for parqv data sources.
|
3
|
+
"""
|
4
|
+
|
5
|
+
from pathlib import Path
|
6
|
+
from typing import Any, Dict, List, Optional
|
7
|
+
|
8
|
+
import pandas as pd
|
9
|
+
|
10
|
+
from ..base import DataHandler, DataHandlerError
|
11
|
+
|
12
|
+
|
13
|
+
class CsvHandlerError(DataHandlerError):
|
14
|
+
"""Custom exception for CSV handling errors."""
|
15
|
+
pass
|
16
|
+
|
17
|
+
|
18
|
+
class CsvHandler(DataHandler):
|
19
|
+
"""
|
20
|
+
Handles CSV file interactions using pandas.
|
21
|
+
|
22
|
+
Provides methods to access metadata, schema, data preview, and column statistics
|
23
|
+
for CSV files using pandas DataFrame operations.
|
24
|
+
"""
|
25
|
+
|
26
|
+
def __init__(self, file_path: Path):
|
27
|
+
"""
|
28
|
+
Initialize the CsvHandler by validating the path and reading the CSV file.
|
29
|
+
|
30
|
+
Args:
|
31
|
+
file_path: Path to the CSV file.
|
32
|
+
|
33
|
+
Raises:
|
34
|
+
CsvHandlerError: If the file is not found, not a file, or cannot be read.
|
35
|
+
"""
|
36
|
+
super().__init__(file_path)
|
37
|
+
self.df: Optional[pd.DataFrame] = None
|
38
|
+
self._original_dtypes: Optional[Dict[str, str]] = None
|
39
|
+
|
40
|
+
try:
|
41
|
+
# Validate file existence
|
42
|
+
if not self.file_path.is_file():
|
43
|
+
raise FileNotFoundError(f"CSV file not found or is not a regular file: {self.file_path}")
|
44
|
+
|
45
|
+
# Read the CSV file with pandas
|
46
|
+
self._read_csv_file()
|
47
|
+
|
48
|
+
self.logger.info(f"Successfully initialized CsvHandler for: {self.file_path.name}")
|
49
|
+
|
50
|
+
except FileNotFoundError as fnf_e:
|
51
|
+
self.logger.error(f"File not found during CsvHandler initialization: {fnf_e}")
|
52
|
+
raise CsvHandlerError(str(fnf_e)) from fnf_e
|
53
|
+
except pd.errors.EmptyDataError as empty_e:
|
54
|
+
self.logger.error(f"CSV file is empty: {empty_e}")
|
55
|
+
raise CsvHandlerError(f"CSV file '{self.file_path.name}' is empty") from empty_e
|
56
|
+
except pd.errors.ParserError as parse_e:
|
57
|
+
self.logger.error(f"CSV parsing error: {parse_e}")
|
58
|
+
raise CsvHandlerError(f"Failed to parse CSV file '{self.file_path.name}': {parse_e}") from parse_e
|
59
|
+
except Exception as e:
|
60
|
+
self.logger.exception(f"Unexpected error initializing CsvHandler for {self.file_path.name}")
|
61
|
+
raise CsvHandlerError(f"Failed to initialize CSV handler '{self.file_path.name}': {e}") from e
|
62
|
+
|
63
|
+
def _read_csv_file(self) -> None:
|
64
|
+
"""Read the CSV file using pandas with appropriate settings."""
|
65
|
+
try:
|
66
|
+
# Read CSV with automatic type inference
|
67
|
+
self.df = pd.read_csv(
|
68
|
+
self.file_path,
|
69
|
+
# Basic settings
|
70
|
+
encoding='utf-8',
|
71
|
+
# Handle various separators automatically
|
72
|
+
sep=None, # Let pandas auto-detect
|
73
|
+
engine='python', # More flexible parsing
|
74
|
+
# Preserve original string representation for better type info
|
75
|
+
dtype=str, # Read everything as string first
|
76
|
+
na_values=['', 'NULL', 'null', 'None', 'N/A', 'n/a', 'NaN', 'nan'],
|
77
|
+
keep_default_na=True,
|
78
|
+
)
|
79
|
+
|
80
|
+
# Store original dtypes before conversion
|
81
|
+
self._original_dtypes = {col: 'string' for col in self.df.columns}
|
82
|
+
|
83
|
+
# Try to infer better types
|
84
|
+
self._infer_types()
|
85
|
+
|
86
|
+
self.logger.debug(f"Successfully read CSV with shape: {self.df.shape}")
|
87
|
+
|
88
|
+
except UnicodeDecodeError:
|
89
|
+
# Try with different encodings
|
90
|
+
for encoding in ['latin1', 'cp1252', 'iso-8859-1']:
|
91
|
+
try:
|
92
|
+
self.logger.warning(f"Trying encoding: {encoding}")
|
93
|
+
self.df = pd.read_csv(
|
94
|
+
self.file_path,
|
95
|
+
encoding=encoding,
|
96
|
+
sep=None,
|
97
|
+
engine='python',
|
98
|
+
dtype=str,
|
99
|
+
na_values=['', 'NULL', 'null', 'None', 'N/A', 'n/a', 'NaN', 'nan'],
|
100
|
+
keep_default_na=True,
|
101
|
+
)
|
102
|
+
self._original_dtypes = {col: 'string' for col in self.df.columns}
|
103
|
+
self._infer_types()
|
104
|
+
self.logger.info(f"Successfully read CSV with encoding: {encoding}")
|
105
|
+
break
|
106
|
+
except UnicodeDecodeError:
|
107
|
+
continue
|
108
|
+
else:
|
109
|
+
raise CsvHandlerError(f"Could not decode CSV file with any common encoding")
|
110
|
+
|
111
|
+
def _infer_types(self) -> None:
|
112
|
+
"""Infer appropriate data types for columns."""
|
113
|
+
if self.df is None:
|
114
|
+
return
|
115
|
+
|
116
|
+
for col in self.df.columns:
|
117
|
+
# Try to convert to numeric
|
118
|
+
numeric_converted = pd.to_numeric(self.df[col], errors='coerce')
|
119
|
+
if not numeric_converted.isna().all():
|
120
|
+
# If most values can be converted to numeric, use numeric type
|
121
|
+
non_na_original = self.df[col].notna().sum()
|
122
|
+
non_na_converted = numeric_converted.notna().sum()
|
123
|
+
|
124
|
+
if non_na_converted / max(non_na_original, 1) > 0.8: # 80% conversion success
|
125
|
+
self.df[col] = numeric_converted
|
126
|
+
if (numeric_converted == numeric_converted.astype('Int64', errors='ignore')).all():
|
127
|
+
self._original_dtypes[col] = 'integer'
|
128
|
+
else:
|
129
|
+
self._original_dtypes[col] = 'float'
|
130
|
+
continue
|
131
|
+
|
132
|
+
# Try to convert to datetime
|
133
|
+
try:
|
134
|
+
datetime_converted = pd.to_datetime(self.df[col], errors='coerce', infer_datetime_format=True)
|
135
|
+
if not datetime_converted.isna().all():
|
136
|
+
non_na_original = self.df[col].notna().sum()
|
137
|
+
non_na_converted = datetime_converted.notna().sum()
|
138
|
+
|
139
|
+
if non_na_converted / max(non_na_original, 1) > 0.8: # 80% conversion success
|
140
|
+
self.df[col] = datetime_converted
|
141
|
+
self._original_dtypes[col] = 'datetime'
|
142
|
+
continue
|
143
|
+
except (ValueError, TypeError):
|
144
|
+
pass
|
145
|
+
|
146
|
+
# Try to convert to boolean
|
147
|
+
bool_values = self.df[col].str.lower().isin(['true', 'false', 't', 'f', '1', '0', 'yes', 'no', 'y', 'n'])
|
148
|
+
if bool_values.sum() / len(self.df[col]) > 0.8:
|
149
|
+
bool_mapping = {
|
150
|
+
'true': True, 'false': False, 't': True, 'f': False,
|
151
|
+
'1': True, '0': False, 'yes': True, 'no': False,
|
152
|
+
'y': True, 'n': False
|
153
|
+
}
|
154
|
+
self.df[col] = self.df[col].str.lower().map(bool_mapping)
|
155
|
+
self._original_dtypes[col] = 'boolean'
|
156
|
+
continue
|
157
|
+
|
158
|
+
# Keep as string
|
159
|
+
self._original_dtypes[col] = 'string'
|
160
|
+
|
161
|
+
def close(self) -> None:
|
162
|
+
"""Close and cleanup resources (CSV data is held in memory)."""
|
163
|
+
if self.df is not None:
|
164
|
+
self.logger.info(f"Closed CSV handler for: {self.file_path.name}")
|
165
|
+
self.df = None
|
166
|
+
self._original_dtypes = None
|
167
|
+
|
168
|
+
def get_metadata_summary(self) -> Dict[str, Any]:
|
169
|
+
"""
|
170
|
+
Get a summary dictionary of the CSV file's metadata.
|
171
|
+
|
172
|
+
Returns:
|
173
|
+
A dictionary containing metadata like file path, format, row count, columns, size.
|
174
|
+
"""
|
175
|
+
if self.df is None:
|
176
|
+
return {"error": "CSV data not loaded or handler closed."}
|
177
|
+
|
178
|
+
try:
|
179
|
+
file_size = self.file_path.stat().st_size
|
180
|
+
size_str = self.format_size(file_size)
|
181
|
+
except Exception as e:
|
182
|
+
self.logger.warning(f"Could not get file size for {self.file_path}: {e}")
|
183
|
+
size_str = "N/A"
|
184
|
+
|
185
|
+
# Create a well-structured metadata summary
|
186
|
+
summary = {
|
187
|
+
"File Information": {
|
188
|
+
"Path": str(self.file_path),
|
189
|
+
"Format": "CSV",
|
190
|
+
"Size": size_str
|
191
|
+
},
|
192
|
+
"Data Structure": {
|
193
|
+
"Total Rows": f"{len(self.df):,}",
|
194
|
+
"Total Columns": f"{len(self.df.columns):,}",
|
195
|
+
"Memory Usage": f"{self.df.memory_usage(deep=True).sum():,} bytes"
|
196
|
+
},
|
197
|
+
"Column Types Summary": self._get_column_types_summary()
|
198
|
+
}
|
199
|
+
|
200
|
+
return summary
|
201
|
+
|
202
|
+
def _get_column_types_summary(self) -> Dict[str, int]:
|
203
|
+
"""Get a summary of column types in the CSV data."""
|
204
|
+
if self.df is None or self._original_dtypes is None:
|
205
|
+
return {}
|
206
|
+
|
207
|
+
type_counts = {}
|
208
|
+
for col_type in self._original_dtypes.values():
|
209
|
+
type_counts[col_type] = type_counts.get(col_type, 0) + 1
|
210
|
+
|
211
|
+
# Format for better display
|
212
|
+
formatted_summary = {}
|
213
|
+
type_labels = {
|
214
|
+
'string': 'Text Columns',
|
215
|
+
'integer': 'Integer Columns',
|
216
|
+
'float': 'Numeric Columns',
|
217
|
+
'datetime': 'Date/Time Columns',
|
218
|
+
'boolean': 'Boolean Columns'
|
219
|
+
}
|
220
|
+
|
221
|
+
for type_key, count in type_counts.items():
|
222
|
+
label = type_labels.get(type_key, f'{type_key.title()} Columns')
|
223
|
+
formatted_summary[label] = f"{count:,}"
|
224
|
+
|
225
|
+
return formatted_summary
|
226
|
+
|
227
|
+
def get_schema_data(self) -> Optional[List[Dict[str, Any]]]:
|
228
|
+
"""
|
229
|
+
Get the schema of the CSV data.
|
230
|
+
|
231
|
+
Returns:
|
232
|
+
A list of dictionaries describing columns (name, type, nullable),
|
233
|
+
or None if schema couldn't be determined.
|
234
|
+
"""
|
235
|
+
if self.df is None:
|
236
|
+
self.logger.warning("DataFrame is not available for schema data")
|
237
|
+
return None
|
238
|
+
|
239
|
+
schema_list = []
|
240
|
+
|
241
|
+
for col in self.df.columns:
|
242
|
+
try:
|
243
|
+
# Get the inferred type
|
244
|
+
col_type = self._original_dtypes.get(col, 'string')
|
245
|
+
|
246
|
+
# Check for null values
|
247
|
+
has_nulls = self.df[col].isna().any()
|
248
|
+
|
249
|
+
schema_list.append({
|
250
|
+
"name": str(col),
|
251
|
+
"type": col_type,
|
252
|
+
"nullable": bool(has_nulls)
|
253
|
+
})
|
254
|
+
|
255
|
+
except Exception as e:
|
256
|
+
self.logger.error(f"Error processing column '{col}' for schema data: {e}")
|
257
|
+
schema_list.append({
|
258
|
+
"name": str(col),
|
259
|
+
"type": f"[Error: {e}]",
|
260
|
+
"nullable": None
|
261
|
+
})
|
262
|
+
|
263
|
+
return schema_list
|
264
|
+
|
265
|
+
def get_data_preview(self, num_rows: int = 50) -> Optional[pd.DataFrame]:
|
266
|
+
"""
|
267
|
+
Fetch a preview of the data.
|
268
|
+
|
269
|
+
Args:
|
270
|
+
num_rows: The maximum number of rows to fetch.
|
271
|
+
|
272
|
+
Returns:
|
273
|
+
A pandas DataFrame with preview data, an empty DataFrame if no data,
|
274
|
+
or a DataFrame with an 'error' column on failure.
|
275
|
+
"""
|
276
|
+
if self.df is None:
|
277
|
+
self.logger.warning("CSV data not available for preview")
|
278
|
+
return pd.DataFrame({"error": ["CSV data not loaded or handler closed."]})
|
279
|
+
|
280
|
+
try:
|
281
|
+
if self.df.empty:
|
282
|
+
self.logger.info("CSV file has no data rows")
|
283
|
+
return pd.DataFrame(columns=self.df.columns)
|
284
|
+
|
285
|
+
# Return first num_rows
|
286
|
+
preview_df = self.df.head(num_rows).copy()
|
287
|
+
self.logger.info(f"Generated preview of {len(preview_df)} rows for {self.file_path.name}")
|
288
|
+
return preview_df
|
289
|
+
|
290
|
+
except Exception as e:
|
291
|
+
self.logger.exception(f"Error generating data preview from CSV file: {self.file_path.name}")
|
292
|
+
return pd.DataFrame({"error": [f"Failed to generate preview: {e}"]})
|
293
|
+
|
294
|
+
def get_column_stats(self, column_name: str) -> Dict[str, Any]:
|
295
|
+
"""
|
296
|
+
Calculate and return statistics for a specific column.
|
297
|
+
|
298
|
+
Args:
|
299
|
+
column_name: The name of the column.
|
300
|
+
|
301
|
+
Returns:
|
302
|
+
A dictionary containing column statistics or error information.
|
303
|
+
"""
|
304
|
+
if self.df is None:
|
305
|
+
return self._create_stats_result(
|
306
|
+
column_name, "Unknown", {}, error="CSV data not loaded or handler closed."
|
307
|
+
)
|
308
|
+
|
309
|
+
if column_name not in self.df.columns:
|
310
|
+
return self._create_stats_result(
|
311
|
+
column_name, "Unknown", {}, error=f"Column '{column_name}' not found in CSV data."
|
312
|
+
)
|
313
|
+
|
314
|
+
try:
|
315
|
+
col_series = self.df[column_name]
|
316
|
+
col_type = self._original_dtypes.get(column_name, 'string')
|
317
|
+
|
318
|
+
# Basic counts
|
319
|
+
total_count = len(col_series)
|
320
|
+
null_count = col_series.isna().sum()
|
321
|
+
valid_count = total_count - null_count
|
322
|
+
null_percentage = (null_count / total_count * 100) if total_count > 0 else 0
|
323
|
+
|
324
|
+
stats = {
|
325
|
+
"Total Count": f"{total_count:,}",
|
326
|
+
"Valid Count": f"{valid_count:,}",
|
327
|
+
"Null Count": f"{null_count:,}",
|
328
|
+
"Null Percentage": f"{null_percentage:.2f}%"
|
329
|
+
}
|
330
|
+
|
331
|
+
# Type-specific statistics
|
332
|
+
if valid_count > 0:
|
333
|
+
valid_series = col_series.dropna()
|
334
|
+
|
335
|
+
# Distinct count (always applicable)
|
336
|
+
distinct_count = valid_series.nunique()
|
337
|
+
stats["Distinct Count"] = f"{distinct_count:,}"
|
338
|
+
|
339
|
+
if col_type in ['integer', 'float']:
|
340
|
+
# Numeric statistics
|
341
|
+
stats.update(self._calculate_numeric_stats_pandas(valid_series))
|
342
|
+
elif col_type == 'datetime':
|
343
|
+
# Datetime statistics
|
344
|
+
stats.update(self._calculate_datetime_stats_pandas(valid_series))
|
345
|
+
elif col_type == 'boolean':
|
346
|
+
# Boolean statistics
|
347
|
+
stats.update(self._calculate_boolean_stats_pandas(valid_series))
|
348
|
+
elif col_type == 'string':
|
349
|
+
# String statistics (min/max by alphabetical order)
|
350
|
+
stats.update(self._calculate_string_stats_pandas(valid_series))
|
351
|
+
|
352
|
+
return self._create_stats_result(column_name, col_type, stats, nullable=null_count > 0)
|
353
|
+
|
354
|
+
except Exception as e:
|
355
|
+
self.logger.exception(f"Error calculating stats for column '{column_name}'")
|
356
|
+
return self._create_stats_result(
|
357
|
+
column_name, "Unknown", {}, error=f"Failed to calculate statistics: {e}"
|
358
|
+
)
|
359
|
+
|
360
|
+
def _calculate_numeric_stats_pandas(self, series: pd.Series) -> Dict[str, Any]:
|
361
|
+
"""Calculate statistics for numeric columns using pandas."""
|
362
|
+
stats = {}
|
363
|
+
try:
|
364
|
+
stats["Min"] = series.min()
|
365
|
+
stats["Max"] = series.max()
|
366
|
+
stats["Mean"] = f"{series.mean():.4f}"
|
367
|
+
stats["Median (50%)"] = series.median()
|
368
|
+
stats["StdDev"] = f"{series.std():.4f}"
|
369
|
+
|
370
|
+
# Add histogram data for visualization
|
371
|
+
try:
|
372
|
+
# Sample data if too large for performance
|
373
|
+
sample_size = min(10000, len(series))
|
374
|
+
if len(series) > sample_size:
|
375
|
+
sampled_series = series.sample(n=sample_size, random_state=42)
|
376
|
+
else:
|
377
|
+
sampled_series = series
|
378
|
+
|
379
|
+
# Convert to list for histogram
|
380
|
+
clean_data = sampled_series.tolist()
|
381
|
+
|
382
|
+
if len(clean_data) > 10: # Only create histogram if we have enough data
|
383
|
+
stats["_histogram_data"] = clean_data
|
384
|
+
stats["_data_type"] = "numeric"
|
385
|
+
|
386
|
+
except Exception as e:
|
387
|
+
self.logger.warning(f"Failed to prepare histogram data: {e}")
|
388
|
+
|
389
|
+
except Exception as e:
|
390
|
+
self.logger.warning(f"Error calculating numeric stats: {e}")
|
391
|
+
stats["Calculation Error"] = str(e)
|
392
|
+
return stats
|
393
|
+
|
394
|
+
def _calculate_datetime_stats_pandas(self, series: pd.Series) -> Dict[str, Any]:
|
395
|
+
"""Calculate statistics for datetime columns using pandas."""
|
396
|
+
stats = {}
|
397
|
+
try:
|
398
|
+
stats["Min"] = series.min()
|
399
|
+
stats["Max"] = series.max()
|
400
|
+
# Calculate time range
|
401
|
+
time_range = series.max() - series.min()
|
402
|
+
stats["Range"] = str(time_range)
|
403
|
+
except Exception as e:
|
404
|
+
self.logger.warning(f"Error calculating datetime stats: {e}")
|
405
|
+
stats["Calculation Error"] = str(e)
|
406
|
+
return stats
|
407
|
+
|
408
|
+
def _calculate_boolean_stats_pandas(self, series: pd.Series) -> Dict[str, Any]:
|
409
|
+
"""Calculate statistics for boolean columns using pandas."""
|
410
|
+
stats = {}
|
411
|
+
try:
|
412
|
+
value_counts = series.value_counts()
|
413
|
+
stats["True Count"] = f"{value_counts.get(True, 0):,}"
|
414
|
+
stats["False Count"] = f"{value_counts.get(False, 0):,}"
|
415
|
+
if len(value_counts) > 0:
|
416
|
+
true_pct = (value_counts.get(True, 0) / len(series) * 100)
|
417
|
+
stats["True Percentage"] = f"{true_pct:.2f}%"
|
418
|
+
except Exception as e:
|
419
|
+
self.logger.warning(f"Error calculating boolean stats: {e}")
|
420
|
+
stats["Calculation Error"] = str(e)
|
421
|
+
return stats
|
422
|
+
|
423
|
+
def _calculate_string_stats_pandas(self, series: pd.Series) -> Dict[str, Any]:
|
424
|
+
"""Calculate statistics for string columns using pandas."""
|
425
|
+
stats = {}
|
426
|
+
try:
|
427
|
+
# Only min/max for strings (alphabetical order)
|
428
|
+
stats["Min"] = str(series.min())
|
429
|
+
stats["Max"] = str(series.max())
|
430
|
+
|
431
|
+
# Most common values
|
432
|
+
value_counts = series.value_counts().head(5)
|
433
|
+
if len(value_counts) > 0:
|
434
|
+
top_values = {}
|
435
|
+
for value, count in value_counts.items():
|
436
|
+
top_values[str(value)] = f"{count:,}"
|
437
|
+
stats["Top Values"] = top_values
|
438
|
+
except Exception as e:
|
439
|
+
self.logger.warning(f"Error calculating string stats: {e}")
|
440
|
+
stats["Calculation Error"] = str(e)
|
441
|
+
return stats
|
442
|
+
|
443
|
+
def _create_stats_result(
|
444
|
+
self,
|
445
|
+
column_name: str,
|
446
|
+
col_type: str,
|
447
|
+
calculated_stats: Dict[str, Any],
|
448
|
+
nullable: Optional[bool] = None,
|
449
|
+
error: Optional[str] = None,
|
450
|
+
message: Optional[str] = None
|
451
|
+
) -> Dict[str, Any]:
|
452
|
+
"""Package the stats results consistently."""
|
453
|
+
return {
|
454
|
+
"column": column_name,
|
455
|
+
"type": col_type,
|
456
|
+
"nullable": nullable if nullable is not None else "Unknown",
|
457
|
+
"calculated": calculated_stats or {},
|
458
|
+
"error": error,
|
459
|
+
"message": message,
|
460
|
+
}
|
@@ -303,6 +303,12 @@ class JsonHandler(DataHandler):
|
|
303
303
|
# SUMMARIZE puts results in the first row
|
304
304
|
stats = self._format_summarize_stats(summarize_df.iloc[0])
|
305
305
|
|
306
|
+
# Add histogram data for numeric columns
|
307
|
+
try:
|
308
|
+
self._add_histogram_data_if_numeric(stats, safe_column_name)
|
309
|
+
except Exception as hist_e:
|
310
|
+
self.logger.warning(f"Failed to add histogram data for {column_name}: {hist_e}")
|
311
|
+
|
306
312
|
except duckdb.Error as db_err:
|
307
313
|
self.logger.exception(f"DuckDB Error calculating statistics for column '{column_name}': {db_err}")
|
308
314
|
error_msg = f"DuckDB calculation failed: {db_err}"
|
@@ -403,6 +409,37 @@ class JsonHandler(DataHandler):
|
|
403
409
|
|
404
410
|
return stats
|
405
411
|
|
412
|
+
def _add_histogram_data_if_numeric(self, stats: Dict[str, Any], safe_column_name: str) -> None:
|
413
|
+
"""Add histogram data for numeric columns by sampling from DuckDB."""
|
414
|
+
# Check if this looks like numeric data (has Mean, Min, Max)
|
415
|
+
if not all(key in stats for key in ["Mean", "Min", "Max"]):
|
416
|
+
return
|
417
|
+
|
418
|
+
try:
|
419
|
+
# Sample data for histogram (limit to 10k samples for performance)
|
420
|
+
sample_query = f"""
|
421
|
+
SELECT {safe_column_name}
|
422
|
+
FROM "{self._view_name}"
|
423
|
+
WHERE {safe_column_name} IS NOT NULL
|
424
|
+
USING SAMPLE 10000
|
425
|
+
"""
|
426
|
+
|
427
|
+
sample_df = self._db_conn.sql(sample_query).df()
|
428
|
+
|
429
|
+
if not sample_df.empty and len(sample_df) > 10:
|
430
|
+
# Extract the column data
|
431
|
+
column_data = sample_df.iloc[:, 0].tolist()
|
432
|
+
|
433
|
+
# Filter out any remaining nulls
|
434
|
+
clean_data = [val for val in column_data if val is not None]
|
435
|
+
|
436
|
+
if len(clean_data) > 10:
|
437
|
+
stats["_histogram_data"] = clean_data
|
438
|
+
stats["_data_type"] = "numeric"
|
439
|
+
|
440
|
+
except Exception as e:
|
441
|
+
self.logger.warning(f"Failed to sample data for histogram: {e}")
|
442
|
+
|
406
443
|
def _create_stats_result(
|
407
444
|
self,
|
408
445
|
column_name: str,
|
@@ -301,7 +301,8 @@ class ParquetHandler(DataHandler):
|
|
301
301
|
message = f"Statistics calculation not implemented for type '{self._format_pyarrow_type(col_type)}'."
|
302
302
|
|
303
303
|
except Exception as calc_err:
|
304
|
-
self.logger.exception(
|
304
|
+
self.logger.exception(
|
305
|
+
f"Error during type-specific calculation for column '{column_name}': {calc_err}")
|
305
306
|
error_msg = f"Calculation error for type {field.type}: {calc_err}"
|
306
307
|
calculated_stats["Calculation Error"] = str(calc_err) # Add specific error key
|
307
308
|
|
@@ -422,6 +423,32 @@ class ParquetHandler(DataHandler):
|
|
422
423
|
variance_val, err_var = self._safe_compute(pc.variance, column_data, ddof=1)
|
423
424
|
stats["Variance"] = f"{variance_val:.4f}" if variance_val is not None and err_var is None else (
|
424
425
|
err_var or "N/A")
|
426
|
+
distinct_val, err = self._safe_compute(pc.count_distinct, column_data)
|
427
|
+
stats["Distinct Count"] = f"{distinct_val:,}" if distinct_val is not None and err is None else (err or "N/A")
|
428
|
+
|
429
|
+
# Add histogram data for visualization
|
430
|
+
try:
|
431
|
+
# Convert to Python list for histogram calculation (sample if too large)
|
432
|
+
data_length = len(column_data)
|
433
|
+
sample_size = min(10000, data_length) # Limit to 10k samples for performance
|
434
|
+
|
435
|
+
if data_length > sample_size:
|
436
|
+
# Sample the data
|
437
|
+
import random
|
438
|
+
indices = sorted(random.sample(range(data_length), sample_size))
|
439
|
+
sampled_data = [column_data[i].as_py() for i in indices]
|
440
|
+
else:
|
441
|
+
sampled_data = column_data.to_pylist()
|
442
|
+
|
443
|
+
# Filter out None values
|
444
|
+
clean_data = [val for val in sampled_data if val is not None]
|
445
|
+
|
446
|
+
if len(clean_data) > 10: # Only create histogram if we have enough data
|
447
|
+
stats["_histogram_data"] = clean_data
|
448
|
+
stats["_data_type"] = "numeric"
|
449
|
+
|
450
|
+
except Exception as e:
|
451
|
+
self.logger.warning(f"Failed to prepare histogram data: {e}")
|
425
452
|
|
426
453
|
return stats
|
427
454
|
|
@@ -4,10 +4,16 @@ Utility functions for parqv views.
|
|
4
4
|
|
5
5
|
from .data_formatters import format_metadata_for_display, format_value_for_display
|
6
6
|
from .stats_formatters import format_stats_for_display, format_column_info
|
7
|
+
from .visualization import create_text_histogram, should_show_histogram
|
7
8
|
|
8
9
|
__all__ = [
|
10
|
+
# Data formatting
|
9
11
|
"format_metadata_for_display",
|
10
|
-
"format_value_for_display",
|
12
|
+
"format_value_for_display",
|
11
13
|
"format_stats_for_display",
|
12
14
|
"format_column_info",
|
13
|
-
|
15
|
+
|
16
|
+
# Visualization
|
17
|
+
"create_text_histogram",
|
18
|
+
"should_show_histogram",
|
19
|
+
]
|
@@ -28,15 +28,21 @@ def format_metadata_for_display(metadata: Dict[str, Any]) -> Dict[str, Any]:
|
|
28
28
|
# Format specific known fields with better presentation
|
29
29
|
field_formatters = {
|
30
30
|
"File Path": lambda x: str(x),
|
31
|
+
"Path": lambda x: str(x),
|
31
32
|
"Format": lambda x: str(x).upper(),
|
32
33
|
"Total Rows": lambda x: _format_number(x),
|
34
|
+
"Total Columns": lambda x: _format_number(x),
|
33
35
|
"Columns": lambda x: _format_number(x),
|
34
36
|
"Size": lambda x: _format_size_if_bytes(x),
|
37
|
+
"Memory Usage": lambda x: _format_size_if_bytes(x),
|
35
38
|
"DuckDB View": lambda x: f"`{x}`" if x else "N/A",
|
36
39
|
}
|
37
40
|
|
38
41
|
for key, value in metadata.items():
|
39
|
-
if
|
42
|
+
if isinstance(value, dict):
|
43
|
+
# Handle nested dictionaries (like grouped metadata)
|
44
|
+
formatted[key] = _format_nested_metadata(value, field_formatters)
|
45
|
+
elif key in field_formatters:
|
40
46
|
formatted[key] = field_formatters[key](value)
|
41
47
|
else:
|
42
48
|
formatted[key] = format_value_for_display(value)
|
@@ -44,6 +50,22 @@ def format_metadata_for_display(metadata: Dict[str, Any]) -> Dict[str, Any]:
|
|
44
50
|
return formatted
|
45
51
|
|
46
52
|
|
53
|
+
def _format_nested_metadata(nested_dict: Dict[str, Any], field_formatters: Dict) -> Dict[str, Any]:
|
54
|
+
"""Format nested metadata dictionaries."""
|
55
|
+
formatted_nested = {}
|
56
|
+
|
57
|
+
for key, value in nested_dict.items():
|
58
|
+
if isinstance(value, dict):
|
59
|
+
# Handle further nesting if needed
|
60
|
+
formatted_nested[key] = _format_nested_metadata(value, field_formatters)
|
61
|
+
elif key in field_formatters:
|
62
|
+
formatted_nested[key] = field_formatters[key](value)
|
63
|
+
else:
|
64
|
+
formatted_nested[key] = format_value_for_display(value)
|
65
|
+
|
66
|
+
return formatted_nested
|
67
|
+
|
68
|
+
|
47
69
|
def format_value_for_display(value: Any) -> str:
|
48
70
|
"""
|
49
71
|
Format a single value for display in the UI.
|
@@ -6,6 +6,8 @@ from typing import Any, Dict, List, Union
|
|
6
6
|
|
7
7
|
from rich.text import Text
|
8
8
|
|
9
|
+
from .visualization import create_text_histogram, should_show_histogram
|
10
|
+
|
9
11
|
|
10
12
|
def format_stats_for_display(stats_data: Dict[str, Any]) -> List[Union[str, Text]]:
|
11
13
|
"""
|
@@ -21,7 +23,7 @@ def format_stats_for_display(stats_data: Dict[str, Any]) -> List[Union[str, Text
|
|
21
23
|
return [Text.from_markup("[red]No statistics data available.[/red]")]
|
22
24
|
|
23
25
|
lines: List[Union[str, Text]] = []
|
24
|
-
|
26
|
+
|
25
27
|
# Extract basic column information
|
26
28
|
col_name = stats_data.get("column", "N/A")
|
27
29
|
col_type = stats_data.get("type", "Unknown")
|
@@ -29,22 +31,22 @@ def format_stats_for_display(stats_data: Dict[str, Any]) -> List[Union[str, Text
|
|
29
31
|
|
30
32
|
# Format column header
|
31
33
|
lines.extend(_format_column_header(col_name, col_type, nullable_val))
|
32
|
-
|
34
|
+
|
33
35
|
# Handle calculation errors
|
34
36
|
calc_error = stats_data.get("error")
|
35
37
|
if calc_error:
|
36
38
|
lines.extend(_format_error_section(calc_error))
|
37
|
-
|
39
|
+
|
38
40
|
# Add informational messages
|
39
41
|
message = stats_data.get("message")
|
40
42
|
if message:
|
41
43
|
lines.extend(_format_message_section(message))
|
42
|
-
|
44
|
+
|
43
45
|
# Format calculated statistics
|
44
46
|
calculated = stats_data.get("calculated")
|
45
47
|
if calculated:
|
46
48
|
lines.extend(_format_calculated_stats(calculated, has_error=bool(calc_error)))
|
47
|
-
|
49
|
+
|
48
50
|
return lines
|
49
51
|
|
50
52
|
|
@@ -72,13 +74,13 @@ def _format_column_header(col_name: str, col_type: str, nullable_val: Any) -> Li
|
|
72
74
|
nullable_str = "Required"
|
73
75
|
else:
|
74
76
|
nullable_str = "Unknown Nullability"
|
75
|
-
|
77
|
+
|
76
78
|
lines = [
|
77
79
|
Text.assemble(("Column: ", "bold"), f"`{col_name}`"),
|
78
80
|
Text.assemble(("Type: ", "bold"), f"{col_type} ({nullable_str})"),
|
79
81
|
"─" * (len(col_name) + len(col_type) + 20)
|
80
82
|
]
|
81
|
-
|
83
|
+
|
82
84
|
return lines
|
83
85
|
|
84
86
|
|
@@ -102,7 +104,7 @@ def _format_message_section(message: str) -> List[Union[str, Text]]:
|
|
102
104
|
def _format_calculated_stats(calculated: Dict[str, Any], has_error: bool = False) -> List[Union[str, Text]]:
|
103
105
|
"""Format the calculated statistics section."""
|
104
106
|
lines = [Text("Calculated Statistics:", style="bold")]
|
105
|
-
|
107
|
+
|
106
108
|
# Define the order of statistics to display
|
107
109
|
stats_order = [
|
108
110
|
"Total Count", "Valid Count", "Null Count", "Null Percentage",
|
@@ -111,32 +113,37 @@ def _format_calculated_stats(calculated: Dict[str, Any], has_error: bool = False
|
|
111
113
|
"True Count", "False Count",
|
112
114
|
"Value Counts"
|
113
115
|
]
|
114
|
-
|
116
|
+
|
115
117
|
found_stats = False
|
116
|
-
|
118
|
+
|
117
119
|
for key in stats_order:
|
118
120
|
if key in calculated:
|
119
121
|
found_stats = True
|
120
122
|
value = calculated[key]
|
121
123
|
lines.extend(_format_single_stat(key, value))
|
122
|
-
|
123
|
-
# Add any additional stats not in the predefined order
|
124
|
+
|
125
|
+
# Add any additional stats not in the predefined order (excluding internal histogram data)
|
124
126
|
for key, value in calculated.items():
|
125
|
-
if key not in stats_order:
|
127
|
+
if key not in stats_order and not key.startswith('_'): # Skip internal fields
|
126
128
|
found_stats = True
|
127
129
|
lines.extend(_format_single_stat(key, value))
|
128
|
-
|
130
|
+
|
129
131
|
# Handle case where no stats were found
|
130
132
|
if not found_stats and not has_error:
|
131
133
|
lines.append(Text(" (No specific stats calculated for this type)", style="dim"))
|
132
|
-
|
134
|
+
|
135
|
+
# Add histogram visualization for numeric data
|
136
|
+
if "_histogram_data" in calculated and "_data_type" in calculated:
|
137
|
+
if calculated["_data_type"] == "numeric":
|
138
|
+
lines.extend(_format_histogram_visualization(calculated))
|
139
|
+
|
133
140
|
return lines
|
134
141
|
|
135
142
|
|
136
143
|
def _format_single_stat(key: str, value: Any) -> List[Union[str, Text]]:
|
137
144
|
"""Format a single statistic entry."""
|
138
145
|
lines = []
|
139
|
-
|
146
|
+
|
140
147
|
if key == "Value Counts" and isinstance(value, dict):
|
141
148
|
lines.append(f" - {key}:")
|
142
149
|
for sub_key, sub_val in value.items():
|
@@ -145,7 +152,7 @@ def _format_single_stat(key: str, value: Any) -> List[Union[str, Text]]:
|
|
145
152
|
else:
|
146
153
|
formatted_value = _format_stat_value(value)
|
147
154
|
lines.append(f" - {key}: {formatted_value}")
|
148
|
-
|
155
|
+
|
149
156
|
return lines
|
150
157
|
|
151
158
|
|
@@ -157,4 +164,57 @@ def _format_stat_value(value: Any) -> str:
|
|
157
164
|
else:
|
158
165
|
return f"{value:,.4f}"
|
159
166
|
else:
|
160
|
-
return str(value)
|
167
|
+
return str(value)
|
168
|
+
|
169
|
+
|
170
|
+
def _format_histogram_visualization(calculated: Dict[str, Any]) -> List[Union[str, Text]]:
|
171
|
+
"""Format histogram visualization for numeric data."""
|
172
|
+
lines = []
|
173
|
+
|
174
|
+
try:
|
175
|
+
histogram_data = calculated.get("_histogram_data", [])
|
176
|
+
if not histogram_data:
|
177
|
+
return lines
|
178
|
+
|
179
|
+
# Check if we should show histogram
|
180
|
+
distinct_count_str = calculated.get("Distinct Count", "0")
|
181
|
+
try:
|
182
|
+
# Remove commas and convert to int
|
183
|
+
distinct_count = int(distinct_count_str.replace(",", ""))
|
184
|
+
except (ValueError, AttributeError):
|
185
|
+
distinct_count = len(set(histogram_data))
|
186
|
+
|
187
|
+
total_count = len(histogram_data)
|
188
|
+
|
189
|
+
if should_show_histogram("numeric", distinct_count, total_count):
|
190
|
+
lines.append("")
|
191
|
+
lines.append(Text("Data Distribution:", style="bold cyan"))
|
192
|
+
|
193
|
+
# Create histogram
|
194
|
+
histogram_lines = create_text_histogram(
|
195
|
+
data=histogram_data,
|
196
|
+
bins=15,
|
197
|
+
width=50,
|
198
|
+
height=8,
|
199
|
+
title=None
|
200
|
+
)
|
201
|
+
|
202
|
+
# Add each histogram line
|
203
|
+
for line in histogram_lines:
|
204
|
+
if isinstance(line, str):
|
205
|
+
lines.append(f" {line}")
|
206
|
+
else:
|
207
|
+
lines.append(line)
|
208
|
+
else:
|
209
|
+
# For discrete data, show a note
|
210
|
+
if distinct_count < total_count * 0.1: # Less than 10% unique values
|
211
|
+
lines.append("")
|
212
|
+
lines.append(Text("Note: Data appears to be discrete/categorical", style="dim italic"))
|
213
|
+
lines.append(Text("(Histogram not shown for discrete values)", style="dim italic"))
|
214
|
+
|
215
|
+
except Exception as e:
|
216
|
+
# Don't fail the whole stats display if histogram fails
|
217
|
+
lines.append("")
|
218
|
+
lines.append(Text(f"Note: Could not generate histogram: {e}", style="dim red"))
|
219
|
+
|
220
|
+
return lines
|
@@ -0,0 +1,204 @@
|
|
1
|
+
"""
|
2
|
+
Visualization utilities for parqv views.
|
3
|
+
|
4
|
+
Provides text-based data visualization functions like ASCII histograms.
|
5
|
+
"""
|
6
|
+
import math
|
7
|
+
from typing import List, Union, Optional
|
8
|
+
|
9
|
+
TICK_CHARS = [' ', '▂', '▃', '▄', '▅', '▆', '▇', '█']
|
10
|
+
|
11
|
+
|
12
|
+
def create_text_histogram(
|
13
|
+
data: List[Union[int, float]],
|
14
|
+
bins: int = 15,
|
15
|
+
width: int = 60,
|
16
|
+
height: int = 8,
|
17
|
+
title: Optional[str] = None
|
18
|
+
) -> List[str]:
|
19
|
+
"""
|
20
|
+
Create a professional, text-based histogram from numerical data.
|
21
|
+
|
22
|
+
Args:
|
23
|
+
data: List of numerical values.
|
24
|
+
bins: The number of bins for the histogram.
|
25
|
+
width: The total character width of the output histogram.
|
26
|
+
height: The maximum height of the histogram bars in lines.
|
27
|
+
title: An optional title for the histogram.
|
28
|
+
|
29
|
+
Returns:
|
30
|
+
A list of strings representing the histogram, ready for printing.
|
31
|
+
"""
|
32
|
+
if not data:
|
33
|
+
return ["(No data available for histogram)"]
|
34
|
+
|
35
|
+
# 1. Sanitize the input data
|
36
|
+
clean_data = [float(val) for val in data if isinstance(val, (int, float)) and math.isfinite(val)]
|
37
|
+
|
38
|
+
if not clean_data:
|
39
|
+
return ["(No valid numerical data to plot)"]
|
40
|
+
|
41
|
+
min_val, max_val = min(clean_data), max(clean_data)
|
42
|
+
|
43
|
+
if min_val == max_val:
|
44
|
+
return [f"(All values are identical: {_format_number(min_val)})"]
|
45
|
+
|
46
|
+
# 2. Create bins and count frequencies
|
47
|
+
# Add a small epsilon to the range to ensure max_val falls into the last bin
|
48
|
+
epsilon = (max_val - min_val) / 1e9
|
49
|
+
value_range = (max_val - min_val) + epsilon
|
50
|
+
bin_width = value_range / bins
|
51
|
+
|
52
|
+
bin_counts = [0] * bins
|
53
|
+
for value in clean_data:
|
54
|
+
bin_index = int((value - min_val) / bin_width)
|
55
|
+
bin_counts[bin_index] += 1
|
56
|
+
|
57
|
+
# 3. Render the histogram
|
58
|
+
return _render_histogram(
|
59
|
+
bin_counts=bin_counts,
|
60
|
+
min_val=min_val,
|
61
|
+
max_val=max_val,
|
62
|
+
width=width,
|
63
|
+
height=height,
|
64
|
+
title=title
|
65
|
+
)
|
66
|
+
|
67
|
+
|
68
|
+
def _render_histogram(
|
69
|
+
bin_counts: List[int],
|
70
|
+
min_val: float,
|
71
|
+
max_val: float,
|
72
|
+
width: int,
|
73
|
+
height: int,
|
74
|
+
title: Optional[str]
|
75
|
+
) -> List[str]:
|
76
|
+
"""
|
77
|
+
Internal function to render the histogram components into ASCII art.
|
78
|
+
"""
|
79
|
+
lines = []
|
80
|
+
if title:
|
81
|
+
lines.append(title.center(width))
|
82
|
+
|
83
|
+
max_count = max(bin_counts) if bin_counts else 0
|
84
|
+
if max_count == 0:
|
85
|
+
return lines + ["(No data falls within histogram bins)"]
|
86
|
+
|
87
|
+
# --- Layout Calculations ---
|
88
|
+
y_axis_width = len(str(max_count))
|
89
|
+
plot_width = width - y_axis_width - 3 # Reserve space for "| " and axis
|
90
|
+
if plot_width <= 0:
|
91
|
+
return ["(Terminal width too narrow to draw histogram)"]
|
92
|
+
|
93
|
+
# Resample the data bins to fit the available plot_width.
|
94
|
+
# This stretches or shrinks the histogram to match the screen space.
|
95
|
+
display_bins = []
|
96
|
+
num_data_bins = len(bin_counts)
|
97
|
+
for i in range(plot_width):
|
98
|
+
# Find the corresponding data bin for this screen column
|
99
|
+
data_bin_index = int(i * num_data_bins / plot_width)
|
100
|
+
display_bins.append(bin_counts[data_bin_index])
|
101
|
+
|
102
|
+
# --- Y-Axis and Bars (Top to Bottom) ---
|
103
|
+
for row in range(height, -1, -1):
|
104
|
+
line = ""
|
105
|
+
# Y-axis labels
|
106
|
+
if row == height:
|
107
|
+
line += f"{max_count:<{y_axis_width}} | "
|
108
|
+
elif row == 0:
|
109
|
+
line += f"{0:<{y_axis_width}} +-"
|
110
|
+
else:
|
111
|
+
line += " " * y_axis_width + " | "
|
112
|
+
|
113
|
+
# Bars - now iterate over the resampled display_bins
|
114
|
+
for count in display_bins:
|
115
|
+
# Scale current count to the available height
|
116
|
+
scaled_height = (count / max_count) * height
|
117
|
+
|
118
|
+
# Determine character based on height relative to current row
|
119
|
+
if scaled_height >= row:
|
120
|
+
line += TICK_CHARS[-1] # Full block for the solid part of the bar
|
121
|
+
elif scaled_height > row - 1:
|
122
|
+
# This is the top of the bar, use a partial character
|
123
|
+
partial_index = int((scaled_height - row + 1) * (len(TICK_CHARS) - 1))
|
124
|
+
line += TICK_CHARS[max(0, partial_index)]
|
125
|
+
elif row == 0:
|
126
|
+
line += "-" # X-axis line
|
127
|
+
else:
|
128
|
+
line += " " # Empty space above the bar
|
129
|
+
|
130
|
+
lines.append(line)
|
131
|
+
|
132
|
+
# --- X-Axis Labels ---
|
133
|
+
x_axis_labels = _create_x_axis_labels(min_val, max_val, plot_width)
|
134
|
+
label_line = " " * (y_axis_width + 3) + x_axis_labels
|
135
|
+
lines.append(label_line)
|
136
|
+
|
137
|
+
return lines
|
138
|
+
|
139
|
+
|
140
|
+
def _create_x_axis_labels(min_val: float, max_val: float, plot_width: int) -> str:
|
141
|
+
"""Create a formatted string for the X-axis labels."""
|
142
|
+
min_label = _format_number(min_val)
|
143
|
+
max_label = _format_number(max_val)
|
144
|
+
|
145
|
+
available_width = plot_width - len(min_label) - len(max_label)
|
146
|
+
|
147
|
+
if available_width < 4:
|
148
|
+
return f"{min_label}{' ' * (plot_width - len(min_label) - len(max_label))}{max_label}"
|
149
|
+
|
150
|
+
mid_val = (min_val + max_val) / 2
|
151
|
+
mid_label = _format_number(mid_val)
|
152
|
+
|
153
|
+
spacing1 = (plot_width // 2) - len(min_label) - (len(mid_label) // 2)
|
154
|
+
spacing2 = (plot_width - (plot_width // 2)) - (len(mid_label) - (len(mid_label) // 2)) - len(max_label)
|
155
|
+
|
156
|
+
if spacing1 < 1 or spacing2 < 1:
|
157
|
+
return f"{min_label}{' ' * (plot_width - len(min_label) - len(max_label))}{max_label}"
|
158
|
+
|
159
|
+
return f"{min_label}{' ' * spacing1}{mid_label}{' ' * spacing2}{max_label}"
|
160
|
+
|
161
|
+
|
162
|
+
def _format_number(value: float) -> str:
|
163
|
+
"""Format a number nicely for display on an axis."""
|
164
|
+
if abs(value) < 1e-4 and value != 0:
|
165
|
+
return f"{value:.1e}"
|
166
|
+
if abs(value) >= 1e5:
|
167
|
+
return f"{value:.1e}"
|
168
|
+
if math.isclose(value, int(value)):
|
169
|
+
return str(int(value))
|
170
|
+
if abs(value) < 10:
|
171
|
+
return f"{value:.2f}"
|
172
|
+
if abs(value) < 100:
|
173
|
+
return f"{value:.1f}"
|
174
|
+
return str(int(value))
|
175
|
+
|
176
|
+
|
177
|
+
def should_show_histogram(data_type: str, distinct_count: int, total_count: int) -> bool:
|
178
|
+
"""
|
179
|
+
Determine if a histogram should be shown for this data.
|
180
|
+
This function uses a set of heuristics to decide if the data is
|
181
|
+
continuous enough to warrant a histogram visualization.
|
182
|
+
"""
|
183
|
+
# 1. Type Check: Histograms are only meaningful for numeric data.
|
184
|
+
if 'numeric' not in data_type and 'integer' not in data_type and 'float' not in data_type:
|
185
|
+
return False
|
186
|
+
|
187
|
+
# 2. Data Volume Check: Don't render if there's too little data or no variation.
|
188
|
+
if total_count < 20 or distinct_count <= 1:
|
189
|
+
return False
|
190
|
+
|
191
|
+
# 3. Categorical Data Filter: If the number of distinct values is very low,
|
192
|
+
# treat it as categorical data (e.g., ratings from 1-10, months 1-12).
|
193
|
+
if distinct_count < 15:
|
194
|
+
return False
|
195
|
+
|
196
|
+
# 4. High Cardinality Filter: If almost every value is unique (like an ID or index),
|
197
|
+
# a histogram is not useful as most bars would have a height of 1.
|
198
|
+
distinct_ratio = distinct_count / total_count
|
199
|
+
if distinct_ratio > 0.95:
|
200
|
+
return False
|
201
|
+
|
202
|
+
# 5. Pass: If the data passes all the above filters, it is considered
|
203
|
+
# sufficiently continuous to be visualized with a histogram.
|
204
|
+
return True
|
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: parqv
|
3
|
-
Version: 0.
|
3
|
+
Version: 0.3.0
|
4
4
|
Summary: An interactive Python TUI for visualizing, exploring, and analyzing files directly in your terminal.
|
5
5
|
Author-email: Sangmin Yoon <sanspareilsmyn@gmail.com>
|
6
6
|
License-Expression: Apache-2.0
|
@@ -23,14 +23,13 @@ Dynamic: license-file
|
|
23
23
|
|
24
24
|
---
|
25
25
|
|
26
|
-
**Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | *(More planned!)*
|
26
|
+
**Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | ✅ **CSV / TSV** | *(More planned!)*
|
27
27
|
|
28
28
|
---
|
29
29
|
|
30
|
-
**`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.**
|
30
|
+
**`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
|
31
31
|
|
32
32
|
## 💻 Demo
|
33
|
-
|
34
33
|

|
35
34
|
*(Demo shows Parquet features; UI adapts for other formats)*
|
36
35
|
|
@@ -47,7 +46,7 @@ Dynamic: license-file
|
|
47
46
|
* **🔌 Extensible:** Designed with a handler interface to easily add support for more file formats in the future (like CSV, Arrow IPC, etc.).
|
48
47
|
|
49
48
|
## ✨ Features (TUI Mode)
|
50
|
-
* **Multi-Format Support:**
|
49
|
+
* **Multi-Format Support:** Now supports **Parquet** (`.parquet`), **JSON/JSON Lines** (`.json`, `.ndjson`), and **CSV/TSV** (`.csv`, `.tsv`). Run `parqv <your_file.{parquet,json,ndjson,csv,tsv}>`.
|
51
50
|
* **Metadata Panel:** Displays key file information (path, format, size, total rows, column count, etc.). *Fields may vary slightly depending on the file format.*
|
52
51
|
* **Schema Explorer:**
|
53
52
|
* Interactive list view of columns.
|
@@ -21,6 +21,7 @@ src/parqv/data_sources/base/__init__.py
|
|
21
21
|
src/parqv/data_sources/base/exceptions.py
|
22
22
|
src/parqv/data_sources/base/handler.py
|
23
23
|
src/parqv/data_sources/formats/__init__.py
|
24
|
+
src/parqv/data_sources/formats/csv.py
|
24
25
|
src/parqv/data_sources/formats/json.py
|
25
26
|
src/parqv/data_sources/formats/parquet.py
|
26
27
|
src/parqv/views/__init__.py
|
@@ -34,4 +35,5 @@ src/parqv/views/components/error_display.py
|
|
34
35
|
src/parqv/views/components/loading_display.py
|
35
36
|
src/parqv/views/utils/__init__.py
|
36
37
|
src/parqv/views/utils/data_formatters.py
|
37
|
-
src/parqv/views/utils/stats_formatters.py
|
38
|
+
src/parqv/views/utils/stats_formatters.py
|
39
|
+
src/parqv/views/utils/visualization.py
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|