parqv 0.2.1__tar.gz → 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. {parqv-0.2.1/src/parqv.egg-info → parqv-0.3.0}/PKG-INFO +4 -5
  2. {parqv-0.2.1 → parqv-0.3.0}/README.md +3 -4
  3. {parqv-0.2.1 → parqv-0.3.0}/pyproject.toml +1 -1
  4. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/core/config.py +2 -1
  5. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/core/handler_factory.py +2 -1
  6. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/__init__.py +4 -0
  7. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/formats/__init__.py +5 -0
  8. parqv-0.3.0/src/parqv/data_sources/formats/csv.py +460 -0
  9. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/formats/json.py +37 -0
  10. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/formats/parquet.py +28 -1
  11. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/utils/__init__.py +8 -2
  12. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/utils/data_formatters.py +23 -1
  13. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/utils/stats_formatters.py +78 -18
  14. parqv-0.3.0/src/parqv/views/utils/visualization.py +204 -0
  15. {parqv-0.2.1 → parqv-0.3.0/src/parqv.egg-info}/PKG-INFO +4 -5
  16. {parqv-0.2.1 → parqv-0.3.0}/src/parqv.egg-info/SOURCES.txt +3 -1
  17. {parqv-0.2.1 → parqv-0.3.0}/LICENSE +0 -0
  18. {parqv-0.2.1 → parqv-0.3.0}/setup.cfg +0 -0
  19. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/__init__.py +0 -0
  20. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/app.py +0 -0
  21. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/cli.py +0 -0
  22. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/core/__init__.py +0 -0
  23. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/core/file_utils.py +0 -0
  24. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/core/logging.py +0 -0
  25. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/base/__init__.py +0 -0
  26. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/base/exceptions.py +0 -0
  27. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/data_sources/base/handler.py +0 -0
  28. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/parqv.css +0 -0
  29. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/__init__.py +0 -0
  30. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/base.py +0 -0
  31. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/components/__init__.py +0 -0
  32. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/components/enhanced_data_table.py +0 -0
  33. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/components/error_display.py +0 -0
  34. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/components/loading_display.py +0 -0
  35. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/data_view.py +0 -0
  36. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/metadata_view.py +0 -0
  37. {parqv-0.2.1 → parqv-0.3.0}/src/parqv/views/schema_view.py +0 -0
  38. {parqv-0.2.1 → parqv-0.3.0}/src/parqv.egg-info/dependency_links.txt +0 -0
  39. {parqv-0.2.1 → parqv-0.3.0}/src/parqv.egg-info/entry_points.txt +0 -0
  40. {parqv-0.2.1 → parqv-0.3.0}/src/parqv.egg-info/requires.txt +0 -0
  41. {parqv-0.2.1 → parqv-0.3.0}/src/parqv.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: parqv
3
- Version: 0.2.1
3
+ Version: 0.3.0
4
4
  Summary: An interactive Python TUI for visualizing, exploring, and analyzing files directly in your terminal.
5
5
  Author-email: Sangmin Yoon <sanspareilsmyn@gmail.com>
6
6
  License-Expression: Apache-2.0
@@ -23,14 +23,13 @@ Dynamic: license-file
23
23
 
24
24
  ---
25
25
 
26
- **Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | *(More planned!)*
26
+ **Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | ✅ **CSV / TSV** | *(More planned!)*
27
27
 
28
28
  ---
29
29
 
30
- **`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** Initially supporting Parquet and JSON, `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
30
+ **`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
31
31
 
32
32
  ## 💻 Demo
33
-
34
33
  ![parqv.gif](assets/parqv.gif)
35
34
  *(Demo shows Parquet features; UI adapts for other formats)*
36
35
 
@@ -47,7 +46,7 @@ Dynamic: license-file
47
46
  * **🔌 Extensible:** Designed with a handler interface to easily add support for more file formats in the future (like CSV, Arrow IPC, etc.).
48
47
 
49
48
  ## ✨ Features (TUI Mode)
50
- * **Multi-Format Support:** Currently supports **Parquet** (`.parquet`) and **JSON/JSON Lines** (`.json`, `.ndjson`). Run `parqv <your_file.{parquet,json,ndjson}>`.
49
+ * **Multi-Format Support:** Now supports **Parquet** (`.parquet`), **JSON/JSON Lines** (`.json`, `.ndjson`), and **CSV/TSV** (`.csv`, `.tsv`). Run `parqv <your_file.{parquet,json,ndjson,csv,tsv}>`.
51
50
  * **Metadata Panel:** Displays key file information (path, format, size, total rows, column count, etc.). *Fields may vary slightly depending on the file format.*
52
51
  * **Schema Explorer:**
53
52
  * Interactive list view of columns.
@@ -7,14 +7,13 @@
7
7
 
8
8
  ---
9
9
 
10
- **Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | *(More planned!)*
10
+ **Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | ✅ **CSV / TSV** | *(More planned!)*
11
11
 
12
12
  ---
13
13
 
14
- **`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** Initially supporting Parquet and JSON, `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
14
+ **`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
15
15
 
16
16
  ## 💻 Demo
17
-
18
17
  ![parqv.gif](assets/parqv.gif)
19
18
  *(Demo shows Parquet features; UI adapts for other formats)*
20
19
 
@@ -31,7 +30,7 @@
31
30
  * **🔌 Extensible:** Designed with a handler interface to easily add support for more file formats in the future (like CSV, Arrow IPC, etc.).
32
31
 
33
32
  ## ✨ Features (TUI Mode)
34
- * **Multi-Format Support:** Currently supports **Parquet** (`.parquet`) and **JSON/JSON Lines** (`.json`, `.ndjson`). Run `parqv <your_file.{parquet,json,ndjson}>`.
33
+ * **Multi-Format Support:** Now supports **Parquet** (`.parquet`), **JSON/JSON Lines** (`.json`, `.ndjson`), and **CSV/TSV** (`.csv`, `.tsv`). Run `parqv <your_file.{parquet,json,ndjson,csv,tsv}>`.
35
34
  * **Metadata Panel:** Displays key file information (path, format, size, total rows, column count, etc.). *Fields may vary slightly depending on the file format.*
36
35
  * **Schema Explorer:**
37
36
  * Interactive list view of columns.
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "parqv"
7
- version = "0.2.1"
7
+ version = "0.3.0"
8
8
  description = "An interactive Python TUI for visualizing, exploring, and analyzing files directly in your terminal."
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.10"
@@ -9,7 +9,8 @@ from pathlib import Path
9
9
  SUPPORTED_EXTENSIONS: Dict[str, str] = {
10
10
  ".parquet": "parquet",
11
11
  ".json": "json",
12
- ".ndjson": "json"
12
+ ".ndjson": "json",
13
+ ".csv": "csv"
13
14
  }
14
15
 
15
16
  # Application constants
@@ -5,7 +5,7 @@ Handler factory for creating appropriate data handlers based on file type.
5
5
  from pathlib import Path
6
6
  from typing import Optional
7
7
 
8
- from ..data_sources import DataHandler, DataHandlerError, ParquetHandler, JsonHandler
8
+ from ..data_sources import DataHandler, DataHandlerError, ParquetHandler, JsonHandler, CsvHandler
9
9
  from .logging import get_logger
10
10
 
11
11
  log = get_logger(__name__)
@@ -23,6 +23,7 @@ class HandlerFactory:
23
23
  _HANDLER_REGISTRY = {
24
24
  "parquet": ParquetHandler,
25
25
  "json": JsonHandler,
26
+ "csv": CsvHandler,
26
27
  }
27
28
 
28
29
  @classmethod
@@ -23,6 +23,8 @@ from .formats import (
23
23
  ParquetHandlerError,
24
24
  JsonHandler,
25
25
  JsonHandlerError,
26
+ CsvHandler,
27
+ CsvHandlerError,
26
28
  )
27
29
 
28
30
  __all__ = [
@@ -41,4 +43,6 @@ __all__ = [
41
43
  "ParquetHandlerError",
42
44
  "JsonHandler",
43
45
  "JsonHandlerError",
46
+ "CsvHandler",
47
+ "CsvHandlerError",
44
48
  ]
@@ -4,6 +4,7 @@ Format-specific data handlers for parqv.
4
4
 
5
5
  from .parquet import ParquetHandler, ParquetHandlerError
6
6
  from .json import JsonHandler, JsonHandlerError
7
+ from .csv import CsvHandler, CsvHandlerError
7
8
 
8
9
  __all__ = [
9
10
  # Parquet format
@@ -13,4 +14,8 @@ __all__ = [
13
14
  # JSON format
14
15
  "JsonHandler",
15
16
  "JsonHandlerError",
17
+
18
+ # CSV format
19
+ "CsvHandler",
20
+ "CsvHandlerError",
16
21
  ]
@@ -0,0 +1,460 @@
1
+ """
2
+ CSV file handler for parqv data sources.
3
+ """
4
+
5
+ from pathlib import Path
6
+ from typing import Any, Dict, List, Optional
7
+
8
+ import pandas as pd
9
+
10
+ from ..base import DataHandler, DataHandlerError
11
+
12
+
13
+ class CsvHandlerError(DataHandlerError):
14
+ """Custom exception for CSV handling errors."""
15
+ pass
16
+
17
+
18
+ class CsvHandler(DataHandler):
19
+ """
20
+ Handles CSV file interactions using pandas.
21
+
22
+ Provides methods to access metadata, schema, data preview, and column statistics
23
+ for CSV files using pandas DataFrame operations.
24
+ """
25
+
26
+ def __init__(self, file_path: Path):
27
+ """
28
+ Initialize the CsvHandler by validating the path and reading the CSV file.
29
+
30
+ Args:
31
+ file_path: Path to the CSV file.
32
+
33
+ Raises:
34
+ CsvHandlerError: If the file is not found, not a file, or cannot be read.
35
+ """
36
+ super().__init__(file_path)
37
+ self.df: Optional[pd.DataFrame] = None
38
+ self._original_dtypes: Optional[Dict[str, str]] = None
39
+
40
+ try:
41
+ # Validate file existence
42
+ if not self.file_path.is_file():
43
+ raise FileNotFoundError(f"CSV file not found or is not a regular file: {self.file_path}")
44
+
45
+ # Read the CSV file with pandas
46
+ self._read_csv_file()
47
+
48
+ self.logger.info(f"Successfully initialized CsvHandler for: {self.file_path.name}")
49
+
50
+ except FileNotFoundError as fnf_e:
51
+ self.logger.error(f"File not found during CsvHandler initialization: {fnf_e}")
52
+ raise CsvHandlerError(str(fnf_e)) from fnf_e
53
+ except pd.errors.EmptyDataError as empty_e:
54
+ self.logger.error(f"CSV file is empty: {empty_e}")
55
+ raise CsvHandlerError(f"CSV file '{self.file_path.name}' is empty") from empty_e
56
+ except pd.errors.ParserError as parse_e:
57
+ self.logger.error(f"CSV parsing error: {parse_e}")
58
+ raise CsvHandlerError(f"Failed to parse CSV file '{self.file_path.name}': {parse_e}") from parse_e
59
+ except Exception as e:
60
+ self.logger.exception(f"Unexpected error initializing CsvHandler for {self.file_path.name}")
61
+ raise CsvHandlerError(f"Failed to initialize CSV handler '{self.file_path.name}': {e}") from e
62
+
63
+ def _read_csv_file(self) -> None:
64
+ """Read the CSV file using pandas with appropriate settings."""
65
+ try:
66
+ # Read CSV with automatic type inference
67
+ self.df = pd.read_csv(
68
+ self.file_path,
69
+ # Basic settings
70
+ encoding='utf-8',
71
+ # Handle various separators automatically
72
+ sep=None, # Let pandas auto-detect
73
+ engine='python', # More flexible parsing
74
+ # Preserve original string representation for better type info
75
+ dtype=str, # Read everything as string first
76
+ na_values=['', 'NULL', 'null', 'None', 'N/A', 'n/a', 'NaN', 'nan'],
77
+ keep_default_na=True,
78
+ )
79
+
80
+ # Store original dtypes before conversion
81
+ self._original_dtypes = {col: 'string' for col in self.df.columns}
82
+
83
+ # Try to infer better types
84
+ self._infer_types()
85
+
86
+ self.logger.debug(f"Successfully read CSV with shape: {self.df.shape}")
87
+
88
+ except UnicodeDecodeError:
89
+ # Try with different encodings
90
+ for encoding in ['latin1', 'cp1252', 'iso-8859-1']:
91
+ try:
92
+ self.logger.warning(f"Trying encoding: {encoding}")
93
+ self.df = pd.read_csv(
94
+ self.file_path,
95
+ encoding=encoding,
96
+ sep=None,
97
+ engine='python',
98
+ dtype=str,
99
+ na_values=['', 'NULL', 'null', 'None', 'N/A', 'n/a', 'NaN', 'nan'],
100
+ keep_default_na=True,
101
+ )
102
+ self._original_dtypes = {col: 'string' for col in self.df.columns}
103
+ self._infer_types()
104
+ self.logger.info(f"Successfully read CSV with encoding: {encoding}")
105
+ break
106
+ except UnicodeDecodeError:
107
+ continue
108
+ else:
109
+ raise CsvHandlerError(f"Could not decode CSV file with any common encoding")
110
+
111
+ def _infer_types(self) -> None:
112
+ """Infer appropriate data types for columns."""
113
+ if self.df is None:
114
+ return
115
+
116
+ for col in self.df.columns:
117
+ # Try to convert to numeric
118
+ numeric_converted = pd.to_numeric(self.df[col], errors='coerce')
119
+ if not numeric_converted.isna().all():
120
+ # If most values can be converted to numeric, use numeric type
121
+ non_na_original = self.df[col].notna().sum()
122
+ non_na_converted = numeric_converted.notna().sum()
123
+
124
+ if non_na_converted / max(non_na_original, 1) > 0.8: # 80% conversion success
125
+ self.df[col] = numeric_converted
126
+ if (numeric_converted == numeric_converted.astype('Int64', errors='ignore')).all():
127
+ self._original_dtypes[col] = 'integer'
128
+ else:
129
+ self._original_dtypes[col] = 'float'
130
+ continue
131
+
132
+ # Try to convert to datetime
133
+ try:
134
+ datetime_converted = pd.to_datetime(self.df[col], errors='coerce', infer_datetime_format=True)
135
+ if not datetime_converted.isna().all():
136
+ non_na_original = self.df[col].notna().sum()
137
+ non_na_converted = datetime_converted.notna().sum()
138
+
139
+ if non_na_converted / max(non_na_original, 1) > 0.8: # 80% conversion success
140
+ self.df[col] = datetime_converted
141
+ self._original_dtypes[col] = 'datetime'
142
+ continue
143
+ except (ValueError, TypeError):
144
+ pass
145
+
146
+ # Try to convert to boolean
147
+ bool_values = self.df[col].str.lower().isin(['true', 'false', 't', 'f', '1', '0', 'yes', 'no', 'y', 'n'])
148
+ if bool_values.sum() / len(self.df[col]) > 0.8:
149
+ bool_mapping = {
150
+ 'true': True, 'false': False, 't': True, 'f': False,
151
+ '1': True, '0': False, 'yes': True, 'no': False,
152
+ 'y': True, 'n': False
153
+ }
154
+ self.df[col] = self.df[col].str.lower().map(bool_mapping)
155
+ self._original_dtypes[col] = 'boolean'
156
+ continue
157
+
158
+ # Keep as string
159
+ self._original_dtypes[col] = 'string'
160
+
161
+ def close(self) -> None:
162
+ """Close and cleanup resources (CSV data is held in memory)."""
163
+ if self.df is not None:
164
+ self.logger.info(f"Closed CSV handler for: {self.file_path.name}")
165
+ self.df = None
166
+ self._original_dtypes = None
167
+
168
+ def get_metadata_summary(self) -> Dict[str, Any]:
169
+ """
170
+ Get a summary dictionary of the CSV file's metadata.
171
+
172
+ Returns:
173
+ A dictionary containing metadata like file path, format, row count, columns, size.
174
+ """
175
+ if self.df is None:
176
+ return {"error": "CSV data not loaded or handler closed."}
177
+
178
+ try:
179
+ file_size = self.file_path.stat().st_size
180
+ size_str = self.format_size(file_size)
181
+ except Exception as e:
182
+ self.logger.warning(f"Could not get file size for {self.file_path}: {e}")
183
+ size_str = "N/A"
184
+
185
+ # Create a well-structured metadata summary
186
+ summary = {
187
+ "File Information": {
188
+ "Path": str(self.file_path),
189
+ "Format": "CSV",
190
+ "Size": size_str
191
+ },
192
+ "Data Structure": {
193
+ "Total Rows": f"{len(self.df):,}",
194
+ "Total Columns": f"{len(self.df.columns):,}",
195
+ "Memory Usage": f"{self.df.memory_usage(deep=True).sum():,} bytes"
196
+ },
197
+ "Column Types Summary": self._get_column_types_summary()
198
+ }
199
+
200
+ return summary
201
+
202
+ def _get_column_types_summary(self) -> Dict[str, int]:
203
+ """Get a summary of column types in the CSV data."""
204
+ if self.df is None or self._original_dtypes is None:
205
+ return {}
206
+
207
+ type_counts = {}
208
+ for col_type in self._original_dtypes.values():
209
+ type_counts[col_type] = type_counts.get(col_type, 0) + 1
210
+
211
+ # Format for better display
212
+ formatted_summary = {}
213
+ type_labels = {
214
+ 'string': 'Text Columns',
215
+ 'integer': 'Integer Columns',
216
+ 'float': 'Numeric Columns',
217
+ 'datetime': 'Date/Time Columns',
218
+ 'boolean': 'Boolean Columns'
219
+ }
220
+
221
+ for type_key, count in type_counts.items():
222
+ label = type_labels.get(type_key, f'{type_key.title()} Columns')
223
+ formatted_summary[label] = f"{count:,}"
224
+
225
+ return formatted_summary
226
+
227
+ def get_schema_data(self) -> Optional[List[Dict[str, Any]]]:
228
+ """
229
+ Get the schema of the CSV data.
230
+
231
+ Returns:
232
+ A list of dictionaries describing columns (name, type, nullable),
233
+ or None if schema couldn't be determined.
234
+ """
235
+ if self.df is None:
236
+ self.logger.warning("DataFrame is not available for schema data")
237
+ return None
238
+
239
+ schema_list = []
240
+
241
+ for col in self.df.columns:
242
+ try:
243
+ # Get the inferred type
244
+ col_type = self._original_dtypes.get(col, 'string')
245
+
246
+ # Check for null values
247
+ has_nulls = self.df[col].isna().any()
248
+
249
+ schema_list.append({
250
+ "name": str(col),
251
+ "type": col_type,
252
+ "nullable": bool(has_nulls)
253
+ })
254
+
255
+ except Exception as e:
256
+ self.logger.error(f"Error processing column '{col}' for schema data: {e}")
257
+ schema_list.append({
258
+ "name": str(col),
259
+ "type": f"[Error: {e}]",
260
+ "nullable": None
261
+ })
262
+
263
+ return schema_list
264
+
265
+ def get_data_preview(self, num_rows: int = 50) -> Optional[pd.DataFrame]:
266
+ """
267
+ Fetch a preview of the data.
268
+
269
+ Args:
270
+ num_rows: The maximum number of rows to fetch.
271
+
272
+ Returns:
273
+ A pandas DataFrame with preview data, an empty DataFrame if no data,
274
+ or a DataFrame with an 'error' column on failure.
275
+ """
276
+ if self.df is None:
277
+ self.logger.warning("CSV data not available for preview")
278
+ return pd.DataFrame({"error": ["CSV data not loaded or handler closed."]})
279
+
280
+ try:
281
+ if self.df.empty:
282
+ self.logger.info("CSV file has no data rows")
283
+ return pd.DataFrame(columns=self.df.columns)
284
+
285
+ # Return first num_rows
286
+ preview_df = self.df.head(num_rows).copy()
287
+ self.logger.info(f"Generated preview of {len(preview_df)} rows for {self.file_path.name}")
288
+ return preview_df
289
+
290
+ except Exception as e:
291
+ self.logger.exception(f"Error generating data preview from CSV file: {self.file_path.name}")
292
+ return pd.DataFrame({"error": [f"Failed to generate preview: {e}"]})
293
+
294
+ def get_column_stats(self, column_name: str) -> Dict[str, Any]:
295
+ """
296
+ Calculate and return statistics for a specific column.
297
+
298
+ Args:
299
+ column_name: The name of the column.
300
+
301
+ Returns:
302
+ A dictionary containing column statistics or error information.
303
+ """
304
+ if self.df is None:
305
+ return self._create_stats_result(
306
+ column_name, "Unknown", {}, error="CSV data not loaded or handler closed."
307
+ )
308
+
309
+ if column_name not in self.df.columns:
310
+ return self._create_stats_result(
311
+ column_name, "Unknown", {}, error=f"Column '{column_name}' not found in CSV data."
312
+ )
313
+
314
+ try:
315
+ col_series = self.df[column_name]
316
+ col_type = self._original_dtypes.get(column_name, 'string')
317
+
318
+ # Basic counts
319
+ total_count = len(col_series)
320
+ null_count = col_series.isna().sum()
321
+ valid_count = total_count - null_count
322
+ null_percentage = (null_count / total_count * 100) if total_count > 0 else 0
323
+
324
+ stats = {
325
+ "Total Count": f"{total_count:,}",
326
+ "Valid Count": f"{valid_count:,}",
327
+ "Null Count": f"{null_count:,}",
328
+ "Null Percentage": f"{null_percentage:.2f}%"
329
+ }
330
+
331
+ # Type-specific statistics
332
+ if valid_count > 0:
333
+ valid_series = col_series.dropna()
334
+
335
+ # Distinct count (always applicable)
336
+ distinct_count = valid_series.nunique()
337
+ stats["Distinct Count"] = f"{distinct_count:,}"
338
+
339
+ if col_type in ['integer', 'float']:
340
+ # Numeric statistics
341
+ stats.update(self._calculate_numeric_stats_pandas(valid_series))
342
+ elif col_type == 'datetime':
343
+ # Datetime statistics
344
+ stats.update(self._calculate_datetime_stats_pandas(valid_series))
345
+ elif col_type == 'boolean':
346
+ # Boolean statistics
347
+ stats.update(self._calculate_boolean_stats_pandas(valid_series))
348
+ elif col_type == 'string':
349
+ # String statistics (min/max by alphabetical order)
350
+ stats.update(self._calculate_string_stats_pandas(valid_series))
351
+
352
+ return self._create_stats_result(column_name, col_type, stats, nullable=null_count > 0)
353
+
354
+ except Exception as e:
355
+ self.logger.exception(f"Error calculating stats for column '{column_name}'")
356
+ return self._create_stats_result(
357
+ column_name, "Unknown", {}, error=f"Failed to calculate statistics: {e}"
358
+ )
359
+
360
+ def _calculate_numeric_stats_pandas(self, series: pd.Series) -> Dict[str, Any]:
361
+ """Calculate statistics for numeric columns using pandas."""
362
+ stats = {}
363
+ try:
364
+ stats["Min"] = series.min()
365
+ stats["Max"] = series.max()
366
+ stats["Mean"] = f"{series.mean():.4f}"
367
+ stats["Median (50%)"] = series.median()
368
+ stats["StdDev"] = f"{series.std():.4f}"
369
+
370
+ # Add histogram data for visualization
371
+ try:
372
+ # Sample data if too large for performance
373
+ sample_size = min(10000, len(series))
374
+ if len(series) > sample_size:
375
+ sampled_series = series.sample(n=sample_size, random_state=42)
376
+ else:
377
+ sampled_series = series
378
+
379
+ # Convert to list for histogram
380
+ clean_data = sampled_series.tolist()
381
+
382
+ if len(clean_data) > 10: # Only create histogram if we have enough data
383
+ stats["_histogram_data"] = clean_data
384
+ stats["_data_type"] = "numeric"
385
+
386
+ except Exception as e:
387
+ self.logger.warning(f"Failed to prepare histogram data: {e}")
388
+
389
+ except Exception as e:
390
+ self.logger.warning(f"Error calculating numeric stats: {e}")
391
+ stats["Calculation Error"] = str(e)
392
+ return stats
393
+
394
+ def _calculate_datetime_stats_pandas(self, series: pd.Series) -> Dict[str, Any]:
395
+ """Calculate statistics for datetime columns using pandas."""
396
+ stats = {}
397
+ try:
398
+ stats["Min"] = series.min()
399
+ stats["Max"] = series.max()
400
+ # Calculate time range
401
+ time_range = series.max() - series.min()
402
+ stats["Range"] = str(time_range)
403
+ except Exception as e:
404
+ self.logger.warning(f"Error calculating datetime stats: {e}")
405
+ stats["Calculation Error"] = str(e)
406
+ return stats
407
+
408
+ def _calculate_boolean_stats_pandas(self, series: pd.Series) -> Dict[str, Any]:
409
+ """Calculate statistics for boolean columns using pandas."""
410
+ stats = {}
411
+ try:
412
+ value_counts = series.value_counts()
413
+ stats["True Count"] = f"{value_counts.get(True, 0):,}"
414
+ stats["False Count"] = f"{value_counts.get(False, 0):,}"
415
+ if len(value_counts) > 0:
416
+ true_pct = (value_counts.get(True, 0) / len(series) * 100)
417
+ stats["True Percentage"] = f"{true_pct:.2f}%"
418
+ except Exception as e:
419
+ self.logger.warning(f"Error calculating boolean stats: {e}")
420
+ stats["Calculation Error"] = str(e)
421
+ return stats
422
+
423
+ def _calculate_string_stats_pandas(self, series: pd.Series) -> Dict[str, Any]:
424
+ """Calculate statistics for string columns using pandas."""
425
+ stats = {}
426
+ try:
427
+ # Only min/max for strings (alphabetical order)
428
+ stats["Min"] = str(series.min())
429
+ stats["Max"] = str(series.max())
430
+
431
+ # Most common values
432
+ value_counts = series.value_counts().head(5)
433
+ if len(value_counts) > 0:
434
+ top_values = {}
435
+ for value, count in value_counts.items():
436
+ top_values[str(value)] = f"{count:,}"
437
+ stats["Top Values"] = top_values
438
+ except Exception as e:
439
+ self.logger.warning(f"Error calculating string stats: {e}")
440
+ stats["Calculation Error"] = str(e)
441
+ return stats
442
+
443
+ def _create_stats_result(
444
+ self,
445
+ column_name: str,
446
+ col_type: str,
447
+ calculated_stats: Dict[str, Any],
448
+ nullable: Optional[bool] = None,
449
+ error: Optional[str] = None,
450
+ message: Optional[str] = None
451
+ ) -> Dict[str, Any]:
452
+ """Package the stats results consistently."""
453
+ return {
454
+ "column": column_name,
455
+ "type": col_type,
456
+ "nullable": nullable if nullable is not None else "Unknown",
457
+ "calculated": calculated_stats or {},
458
+ "error": error,
459
+ "message": message,
460
+ }
@@ -303,6 +303,12 @@ class JsonHandler(DataHandler):
303
303
  # SUMMARIZE puts results in the first row
304
304
  stats = self._format_summarize_stats(summarize_df.iloc[0])
305
305
 
306
+ # Add histogram data for numeric columns
307
+ try:
308
+ self._add_histogram_data_if_numeric(stats, safe_column_name)
309
+ except Exception as hist_e:
310
+ self.logger.warning(f"Failed to add histogram data for {column_name}: {hist_e}")
311
+
306
312
  except duckdb.Error as db_err:
307
313
  self.logger.exception(f"DuckDB Error calculating statistics for column '{column_name}': {db_err}")
308
314
  error_msg = f"DuckDB calculation failed: {db_err}"
@@ -403,6 +409,37 @@ class JsonHandler(DataHandler):
403
409
 
404
410
  return stats
405
411
 
412
+ def _add_histogram_data_if_numeric(self, stats: Dict[str, Any], safe_column_name: str) -> None:
413
+ """Add histogram data for numeric columns by sampling from DuckDB."""
414
+ # Check if this looks like numeric data (has Mean, Min, Max)
415
+ if not all(key in stats for key in ["Mean", "Min", "Max"]):
416
+ return
417
+
418
+ try:
419
+ # Sample data for histogram (limit to 10k samples for performance)
420
+ sample_query = f"""
421
+ SELECT {safe_column_name}
422
+ FROM "{self._view_name}"
423
+ WHERE {safe_column_name} IS NOT NULL
424
+ USING SAMPLE 10000
425
+ """
426
+
427
+ sample_df = self._db_conn.sql(sample_query).df()
428
+
429
+ if not sample_df.empty and len(sample_df) > 10:
430
+ # Extract the column data
431
+ column_data = sample_df.iloc[:, 0].tolist()
432
+
433
+ # Filter out any remaining nulls
434
+ clean_data = [val for val in column_data if val is not None]
435
+
436
+ if len(clean_data) > 10:
437
+ stats["_histogram_data"] = clean_data
438
+ stats["_data_type"] = "numeric"
439
+
440
+ except Exception as e:
441
+ self.logger.warning(f"Failed to sample data for histogram: {e}")
442
+
406
443
  def _create_stats_result(
407
444
  self,
408
445
  column_name: str,
@@ -301,7 +301,8 @@ class ParquetHandler(DataHandler):
301
301
  message = f"Statistics calculation not implemented for type '{self._format_pyarrow_type(col_type)}'."
302
302
 
303
303
  except Exception as calc_err:
304
- self.logger.exception(f"Error during type-specific calculation for column '{column_name}': {calc_err}")
304
+ self.logger.exception(
305
+ f"Error during type-specific calculation for column '{column_name}': {calc_err}")
305
306
  error_msg = f"Calculation error for type {field.type}: {calc_err}"
306
307
  calculated_stats["Calculation Error"] = str(calc_err) # Add specific error key
307
308
 
@@ -422,6 +423,32 @@ class ParquetHandler(DataHandler):
422
423
  variance_val, err_var = self._safe_compute(pc.variance, column_data, ddof=1)
423
424
  stats["Variance"] = f"{variance_val:.4f}" if variance_val is not None and err_var is None else (
424
425
  err_var or "N/A")
426
+ distinct_val, err = self._safe_compute(pc.count_distinct, column_data)
427
+ stats["Distinct Count"] = f"{distinct_val:,}" if distinct_val is not None and err is None else (err or "N/A")
428
+
429
+ # Add histogram data for visualization
430
+ try:
431
+ # Convert to Python list for histogram calculation (sample if too large)
432
+ data_length = len(column_data)
433
+ sample_size = min(10000, data_length) # Limit to 10k samples for performance
434
+
435
+ if data_length > sample_size:
436
+ # Sample the data
437
+ import random
438
+ indices = sorted(random.sample(range(data_length), sample_size))
439
+ sampled_data = [column_data[i].as_py() for i in indices]
440
+ else:
441
+ sampled_data = column_data.to_pylist()
442
+
443
+ # Filter out None values
444
+ clean_data = [val for val in sampled_data if val is not None]
445
+
446
+ if len(clean_data) > 10: # Only create histogram if we have enough data
447
+ stats["_histogram_data"] = clean_data
448
+ stats["_data_type"] = "numeric"
449
+
450
+ except Exception as e:
451
+ self.logger.warning(f"Failed to prepare histogram data: {e}")
425
452
 
426
453
  return stats
427
454
 
@@ -4,10 +4,16 @@ Utility functions for parqv views.
4
4
 
5
5
  from .data_formatters import format_metadata_for_display, format_value_for_display
6
6
  from .stats_formatters import format_stats_for_display, format_column_info
7
+ from .visualization import create_text_histogram, should_show_histogram
7
8
 
8
9
  __all__ = [
10
+ # Data formatting
9
11
  "format_metadata_for_display",
10
- "format_value_for_display",
12
+ "format_value_for_display",
11
13
  "format_stats_for_display",
12
14
  "format_column_info",
13
- ]
15
+
16
+ # Visualization
17
+ "create_text_histogram",
18
+ "should_show_histogram",
19
+ ]
@@ -28,15 +28,21 @@ def format_metadata_for_display(metadata: Dict[str, Any]) -> Dict[str, Any]:
28
28
  # Format specific known fields with better presentation
29
29
  field_formatters = {
30
30
  "File Path": lambda x: str(x),
31
+ "Path": lambda x: str(x),
31
32
  "Format": lambda x: str(x).upper(),
32
33
  "Total Rows": lambda x: _format_number(x),
34
+ "Total Columns": lambda x: _format_number(x),
33
35
  "Columns": lambda x: _format_number(x),
34
36
  "Size": lambda x: _format_size_if_bytes(x),
37
+ "Memory Usage": lambda x: _format_size_if_bytes(x),
35
38
  "DuckDB View": lambda x: f"`{x}`" if x else "N/A",
36
39
  }
37
40
 
38
41
  for key, value in metadata.items():
39
- if key in field_formatters:
42
+ if isinstance(value, dict):
43
+ # Handle nested dictionaries (like grouped metadata)
44
+ formatted[key] = _format_nested_metadata(value, field_formatters)
45
+ elif key in field_formatters:
40
46
  formatted[key] = field_formatters[key](value)
41
47
  else:
42
48
  formatted[key] = format_value_for_display(value)
@@ -44,6 +50,22 @@ def format_metadata_for_display(metadata: Dict[str, Any]) -> Dict[str, Any]:
44
50
  return formatted
45
51
 
46
52
 
53
+ def _format_nested_metadata(nested_dict: Dict[str, Any], field_formatters: Dict) -> Dict[str, Any]:
54
+ """Format nested metadata dictionaries."""
55
+ formatted_nested = {}
56
+
57
+ for key, value in nested_dict.items():
58
+ if isinstance(value, dict):
59
+ # Handle further nesting if needed
60
+ formatted_nested[key] = _format_nested_metadata(value, field_formatters)
61
+ elif key in field_formatters:
62
+ formatted_nested[key] = field_formatters[key](value)
63
+ else:
64
+ formatted_nested[key] = format_value_for_display(value)
65
+
66
+ return formatted_nested
67
+
68
+
47
69
  def format_value_for_display(value: Any) -> str:
48
70
  """
49
71
  Format a single value for display in the UI.
@@ -6,6 +6,8 @@ from typing import Any, Dict, List, Union
6
6
 
7
7
  from rich.text import Text
8
8
 
9
+ from .visualization import create_text_histogram, should_show_histogram
10
+
9
11
 
10
12
  def format_stats_for_display(stats_data: Dict[str, Any]) -> List[Union[str, Text]]:
11
13
  """
@@ -21,7 +23,7 @@ def format_stats_for_display(stats_data: Dict[str, Any]) -> List[Union[str, Text
21
23
  return [Text.from_markup("[red]No statistics data available.[/red]")]
22
24
 
23
25
  lines: List[Union[str, Text]] = []
24
-
26
+
25
27
  # Extract basic column information
26
28
  col_name = stats_data.get("column", "N/A")
27
29
  col_type = stats_data.get("type", "Unknown")
@@ -29,22 +31,22 @@ def format_stats_for_display(stats_data: Dict[str, Any]) -> List[Union[str, Text
29
31
 
30
32
  # Format column header
31
33
  lines.extend(_format_column_header(col_name, col_type, nullable_val))
32
-
34
+
33
35
  # Handle calculation errors
34
36
  calc_error = stats_data.get("error")
35
37
  if calc_error:
36
38
  lines.extend(_format_error_section(calc_error))
37
-
39
+
38
40
  # Add informational messages
39
41
  message = stats_data.get("message")
40
42
  if message:
41
43
  lines.extend(_format_message_section(message))
42
-
44
+
43
45
  # Format calculated statistics
44
46
  calculated = stats_data.get("calculated")
45
47
  if calculated:
46
48
  lines.extend(_format_calculated_stats(calculated, has_error=bool(calc_error)))
47
-
49
+
48
50
  return lines
49
51
 
50
52
 
@@ -72,13 +74,13 @@ def _format_column_header(col_name: str, col_type: str, nullable_val: Any) -> Li
72
74
  nullable_str = "Required"
73
75
  else:
74
76
  nullable_str = "Unknown Nullability"
75
-
77
+
76
78
  lines = [
77
79
  Text.assemble(("Column: ", "bold"), f"`{col_name}`"),
78
80
  Text.assemble(("Type: ", "bold"), f"{col_type} ({nullable_str})"),
79
81
  "─" * (len(col_name) + len(col_type) + 20)
80
82
  ]
81
-
83
+
82
84
  return lines
83
85
 
84
86
 
@@ -102,7 +104,7 @@ def _format_message_section(message: str) -> List[Union[str, Text]]:
102
104
  def _format_calculated_stats(calculated: Dict[str, Any], has_error: bool = False) -> List[Union[str, Text]]:
103
105
  """Format the calculated statistics section."""
104
106
  lines = [Text("Calculated Statistics:", style="bold")]
105
-
107
+
106
108
  # Define the order of statistics to display
107
109
  stats_order = [
108
110
  "Total Count", "Valid Count", "Null Count", "Null Percentage",
@@ -111,32 +113,37 @@ def _format_calculated_stats(calculated: Dict[str, Any], has_error: bool = False
111
113
  "True Count", "False Count",
112
114
  "Value Counts"
113
115
  ]
114
-
116
+
115
117
  found_stats = False
116
-
118
+
117
119
  for key in stats_order:
118
120
  if key in calculated:
119
121
  found_stats = True
120
122
  value = calculated[key]
121
123
  lines.extend(_format_single_stat(key, value))
122
-
123
- # Add any additional stats not in the predefined order
124
+
125
+ # Add any additional stats not in the predefined order (excluding internal histogram data)
124
126
  for key, value in calculated.items():
125
- if key not in stats_order:
127
+ if key not in stats_order and not key.startswith('_'): # Skip internal fields
126
128
  found_stats = True
127
129
  lines.extend(_format_single_stat(key, value))
128
-
130
+
129
131
  # Handle case where no stats were found
130
132
  if not found_stats and not has_error:
131
133
  lines.append(Text(" (No specific stats calculated for this type)", style="dim"))
132
-
134
+
135
+ # Add histogram visualization for numeric data
136
+ if "_histogram_data" in calculated and "_data_type" in calculated:
137
+ if calculated["_data_type"] == "numeric":
138
+ lines.extend(_format_histogram_visualization(calculated))
139
+
133
140
  return lines
134
141
 
135
142
 
136
143
  def _format_single_stat(key: str, value: Any) -> List[Union[str, Text]]:
137
144
  """Format a single statistic entry."""
138
145
  lines = []
139
-
146
+
140
147
  if key == "Value Counts" and isinstance(value, dict):
141
148
  lines.append(f" - {key}:")
142
149
  for sub_key, sub_val in value.items():
@@ -145,7 +152,7 @@ def _format_single_stat(key: str, value: Any) -> List[Union[str, Text]]:
145
152
  else:
146
153
  formatted_value = _format_stat_value(value)
147
154
  lines.append(f" - {key}: {formatted_value}")
148
-
155
+
149
156
  return lines
150
157
 
151
158
 
@@ -157,4 +164,57 @@ def _format_stat_value(value: Any) -> str:
157
164
  else:
158
165
  return f"{value:,.4f}"
159
166
  else:
160
- return str(value)
167
+ return str(value)
168
+
169
+
170
+ def _format_histogram_visualization(calculated: Dict[str, Any]) -> List[Union[str, Text]]:
171
+ """Format histogram visualization for numeric data."""
172
+ lines = []
173
+
174
+ try:
175
+ histogram_data = calculated.get("_histogram_data", [])
176
+ if not histogram_data:
177
+ return lines
178
+
179
+ # Check if we should show histogram
180
+ distinct_count_str = calculated.get("Distinct Count", "0")
181
+ try:
182
+ # Remove commas and convert to int
183
+ distinct_count = int(distinct_count_str.replace(",", ""))
184
+ except (ValueError, AttributeError):
185
+ distinct_count = len(set(histogram_data))
186
+
187
+ total_count = len(histogram_data)
188
+
189
+ if should_show_histogram("numeric", distinct_count, total_count):
190
+ lines.append("")
191
+ lines.append(Text("Data Distribution:", style="bold cyan"))
192
+
193
+ # Create histogram
194
+ histogram_lines = create_text_histogram(
195
+ data=histogram_data,
196
+ bins=15,
197
+ width=50,
198
+ height=8,
199
+ title=None
200
+ )
201
+
202
+ # Add each histogram line
203
+ for line in histogram_lines:
204
+ if isinstance(line, str):
205
+ lines.append(f" {line}")
206
+ else:
207
+ lines.append(line)
208
+ else:
209
+ # For discrete data, show a note
210
+ if distinct_count < total_count * 0.1: # Less than 10% unique values
211
+ lines.append("")
212
+ lines.append(Text("Note: Data appears to be discrete/categorical", style="dim italic"))
213
+ lines.append(Text("(Histogram not shown for discrete values)", style="dim italic"))
214
+
215
+ except Exception as e:
216
+ # Don't fail the whole stats display if histogram fails
217
+ lines.append("")
218
+ lines.append(Text(f"Note: Could not generate histogram: {e}", style="dim red"))
219
+
220
+ return lines
@@ -0,0 +1,204 @@
1
+ """
2
+ Visualization utilities for parqv views.
3
+
4
+ Provides text-based data visualization functions like ASCII histograms.
5
+ """
6
+ import math
7
+ from typing import List, Union, Optional
8
+
9
+ TICK_CHARS = [' ', '▂', '▃', '▄', '▅', '▆', '▇', '█']
10
+
11
+
12
+ def create_text_histogram(
13
+ data: List[Union[int, float]],
14
+ bins: int = 15,
15
+ width: int = 60,
16
+ height: int = 8,
17
+ title: Optional[str] = None
18
+ ) -> List[str]:
19
+ """
20
+ Create a professional, text-based histogram from numerical data.
21
+
22
+ Args:
23
+ data: List of numerical values.
24
+ bins: The number of bins for the histogram.
25
+ width: The total character width of the output histogram.
26
+ height: The maximum height of the histogram bars in lines.
27
+ title: An optional title for the histogram.
28
+
29
+ Returns:
30
+ A list of strings representing the histogram, ready for printing.
31
+ """
32
+ if not data:
33
+ return ["(No data available for histogram)"]
34
+
35
+ # 1. Sanitize the input data
36
+ clean_data = [float(val) for val in data if isinstance(val, (int, float)) and math.isfinite(val)]
37
+
38
+ if not clean_data:
39
+ return ["(No valid numerical data to plot)"]
40
+
41
+ min_val, max_val = min(clean_data), max(clean_data)
42
+
43
+ if min_val == max_val:
44
+ return [f"(All values are identical: {_format_number(min_val)})"]
45
+
46
+ # 2. Create bins and count frequencies
47
+ # Add a small epsilon to the range to ensure max_val falls into the last bin
48
+ epsilon = (max_val - min_val) / 1e9
49
+ value_range = (max_val - min_val) + epsilon
50
+ bin_width = value_range / bins
51
+
52
+ bin_counts = [0] * bins
53
+ for value in clean_data:
54
+ bin_index = int((value - min_val) / bin_width)
55
+ bin_counts[bin_index] += 1
56
+
57
+ # 3. Render the histogram
58
+ return _render_histogram(
59
+ bin_counts=bin_counts,
60
+ min_val=min_val,
61
+ max_val=max_val,
62
+ width=width,
63
+ height=height,
64
+ title=title
65
+ )
66
+
67
+
68
+ def _render_histogram(
69
+ bin_counts: List[int],
70
+ min_val: float,
71
+ max_val: float,
72
+ width: int,
73
+ height: int,
74
+ title: Optional[str]
75
+ ) -> List[str]:
76
+ """
77
+ Internal function to render the histogram components into ASCII art.
78
+ """
79
+ lines = []
80
+ if title:
81
+ lines.append(title.center(width))
82
+
83
+ max_count = max(bin_counts) if bin_counts else 0
84
+ if max_count == 0:
85
+ return lines + ["(No data falls within histogram bins)"]
86
+
87
+ # --- Layout Calculations ---
88
+ y_axis_width = len(str(max_count))
89
+ plot_width = width - y_axis_width - 3 # Reserve space for "| " and axis
90
+ if plot_width <= 0:
91
+ return ["(Terminal width too narrow to draw histogram)"]
92
+
93
+ # Resample the data bins to fit the available plot_width.
94
+ # This stretches or shrinks the histogram to match the screen space.
95
+ display_bins = []
96
+ num_data_bins = len(bin_counts)
97
+ for i in range(plot_width):
98
+ # Find the corresponding data bin for this screen column
99
+ data_bin_index = int(i * num_data_bins / plot_width)
100
+ display_bins.append(bin_counts[data_bin_index])
101
+
102
+ # --- Y-Axis and Bars (Top to Bottom) ---
103
+ for row in range(height, -1, -1):
104
+ line = ""
105
+ # Y-axis labels
106
+ if row == height:
107
+ line += f"{max_count:<{y_axis_width}} | "
108
+ elif row == 0:
109
+ line += f"{0:<{y_axis_width}} +-"
110
+ else:
111
+ line += " " * y_axis_width + " | "
112
+
113
+ # Bars - now iterate over the resampled display_bins
114
+ for count in display_bins:
115
+ # Scale current count to the available height
116
+ scaled_height = (count / max_count) * height
117
+
118
+ # Determine character based on height relative to current row
119
+ if scaled_height >= row:
120
+ line += TICK_CHARS[-1] # Full block for the solid part of the bar
121
+ elif scaled_height > row - 1:
122
+ # This is the top of the bar, use a partial character
123
+ partial_index = int((scaled_height - row + 1) * (len(TICK_CHARS) - 1))
124
+ line += TICK_CHARS[max(0, partial_index)]
125
+ elif row == 0:
126
+ line += "-" # X-axis line
127
+ else:
128
+ line += " " # Empty space above the bar
129
+
130
+ lines.append(line)
131
+
132
+ # --- X-Axis Labels ---
133
+ x_axis_labels = _create_x_axis_labels(min_val, max_val, plot_width)
134
+ label_line = " " * (y_axis_width + 3) + x_axis_labels
135
+ lines.append(label_line)
136
+
137
+ return lines
138
+
139
+
140
+ def _create_x_axis_labels(min_val: float, max_val: float, plot_width: int) -> str:
141
+ """Create a formatted string for the X-axis labels."""
142
+ min_label = _format_number(min_val)
143
+ max_label = _format_number(max_val)
144
+
145
+ available_width = plot_width - len(min_label) - len(max_label)
146
+
147
+ if available_width < 4:
148
+ return f"{min_label}{' ' * (plot_width - len(min_label) - len(max_label))}{max_label}"
149
+
150
+ mid_val = (min_val + max_val) / 2
151
+ mid_label = _format_number(mid_val)
152
+
153
+ spacing1 = (plot_width // 2) - len(min_label) - (len(mid_label) // 2)
154
+ spacing2 = (plot_width - (plot_width // 2)) - (len(mid_label) - (len(mid_label) // 2)) - len(max_label)
155
+
156
+ if spacing1 < 1 or spacing2 < 1:
157
+ return f"{min_label}{' ' * (plot_width - len(min_label) - len(max_label))}{max_label}"
158
+
159
+ return f"{min_label}{' ' * spacing1}{mid_label}{' ' * spacing2}{max_label}"
160
+
161
+
162
+ def _format_number(value: float) -> str:
163
+ """Format a number nicely for display on an axis."""
164
+ if abs(value) < 1e-4 and value != 0:
165
+ return f"{value:.1e}"
166
+ if abs(value) >= 1e5:
167
+ return f"{value:.1e}"
168
+ if math.isclose(value, int(value)):
169
+ return str(int(value))
170
+ if abs(value) < 10:
171
+ return f"{value:.2f}"
172
+ if abs(value) < 100:
173
+ return f"{value:.1f}"
174
+ return str(int(value))
175
+
176
+
177
+ def should_show_histogram(data_type: str, distinct_count: int, total_count: int) -> bool:
178
+ """
179
+ Determine if a histogram should be shown for this data.
180
+ This function uses a set of heuristics to decide if the data is
181
+ continuous enough to warrant a histogram visualization.
182
+ """
183
+ # 1. Type Check: Histograms are only meaningful for numeric data.
184
+ if 'numeric' not in data_type and 'integer' not in data_type and 'float' not in data_type:
185
+ return False
186
+
187
+ # 2. Data Volume Check: Don't render if there's too little data or no variation.
188
+ if total_count < 20 or distinct_count <= 1:
189
+ return False
190
+
191
+ # 3. Categorical Data Filter: If the number of distinct values is very low,
192
+ # treat it as categorical data (e.g., ratings from 1-10, months 1-12).
193
+ if distinct_count < 15:
194
+ return False
195
+
196
+ # 4. High Cardinality Filter: If almost every value is unique (like an ID or index),
197
+ # a histogram is not useful as most bars would have a height of 1.
198
+ distinct_ratio = distinct_count / total_count
199
+ if distinct_ratio > 0.95:
200
+ return False
201
+
202
+ # 5. Pass: If the data passes all the above filters, it is considered
203
+ # sufficiently continuous to be visualized with a histogram.
204
+ return True
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: parqv
3
- Version: 0.2.1
3
+ Version: 0.3.0
4
4
  Summary: An interactive Python TUI for visualizing, exploring, and analyzing files directly in your terminal.
5
5
  Author-email: Sangmin Yoon <sanspareilsmyn@gmail.com>
6
6
  License-Expression: Apache-2.0
@@ -23,14 +23,13 @@ Dynamic: license-file
23
23
 
24
24
  ---
25
25
 
26
- **Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | *(More planned!)*
26
+ **Supported File Formats:** ✅ **Parquet** | ✅ **JSON** / **JSON Lines (ndjson)** | ✅ **CSV / TSV** | *(More planned!)*
27
27
 
28
28
  ---
29
29
 
30
- **`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** Initially supporting Parquet and JSON, `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
30
+ **`parqv` is a Python-based interactive TUI (Text User Interface) tool designed to explore, analyze, and understand various data file formats directly within your terminal.** `parqv` aims to provide a unified, visual experience for quick data inspection without leaving your console.
31
31
 
32
32
  ## 💻 Demo
33
-
34
33
  ![parqv.gif](assets/parqv.gif)
35
34
  *(Demo shows Parquet features; UI adapts for other formats)*
36
35
 
@@ -47,7 +46,7 @@ Dynamic: license-file
47
46
  * **🔌 Extensible:** Designed with a handler interface to easily add support for more file formats in the future (like CSV, Arrow IPC, etc.).
48
47
 
49
48
  ## ✨ Features (TUI Mode)
50
- * **Multi-Format Support:** Currently supports **Parquet** (`.parquet`) and **JSON/JSON Lines** (`.json`, `.ndjson`). Run `parqv <your_file.{parquet,json,ndjson}>`.
49
+ * **Multi-Format Support:** Now supports **Parquet** (`.parquet`), **JSON/JSON Lines** (`.json`, `.ndjson`), and **CSV/TSV** (`.csv`, `.tsv`). Run `parqv <your_file.{parquet,json,ndjson,csv,tsv}>`.
51
50
  * **Metadata Panel:** Displays key file information (path, format, size, total rows, column count, etc.). *Fields may vary slightly depending on the file format.*
52
51
  * **Schema Explorer:**
53
52
  * Interactive list view of columns.
@@ -21,6 +21,7 @@ src/parqv/data_sources/base/__init__.py
21
21
  src/parqv/data_sources/base/exceptions.py
22
22
  src/parqv/data_sources/base/handler.py
23
23
  src/parqv/data_sources/formats/__init__.py
24
+ src/parqv/data_sources/formats/csv.py
24
25
  src/parqv/data_sources/formats/json.py
25
26
  src/parqv/data_sources/formats/parquet.py
26
27
  src/parqv/views/__init__.py
@@ -34,4 +35,5 @@ src/parqv/views/components/error_display.py
34
35
  src/parqv/views/components/loading_display.py
35
36
  src/parqv/views/utils/__init__.py
36
37
  src/parqv/views/utils/data_formatters.py
37
- src/parqv/views/utils/stats_formatters.py
38
+ src/parqv/views/utils/stats_formatters.py
39
+ src/parqv/views/utils/visualization.py
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes