PyPI - csvpredict - Versions diffs - 0.0.1__tar.gz - Mend

csvpredict 0.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (111) hide show

csvpredict-0.0.1/PKG-INFO ADDED Viewed

@@ -0,0 +1,577 @@
+Metadata-Version: 2.3
+Name: csvpredict
+Version: 0.0.1
+Summary: SDK for the CSVPredict API - analyze and visualize CSV data
+Author: Leon David Zipp
+Author-email: Leon David Zipp <leondavidzipp@gmx.de>
+Requires-Dist: attrs>=25.4.0
+Requires-Dist: httpx>=0.28.1
+Requires-Dist: ipython>=9.9.0
+Requires-Dist: numpy>=2.4.1
+Requires-Dist: pandas>=2.3.3
+Requires-Dist: pillow>=12.1.0
+Requires-Dist: polars>=1.37.1
+Requires-Dist: pyarrow>=23.0.0
+Requires-Dist: pydantic>=2.12.5
+Requires-Dist: pyspark>=4.1.1
+Requires-Dist: urllib3>=2.6.3
+Requires-Dist: pandas>=2.0.0 ; extra == 'all'
+Requires-Dist: polars>=0.20.0 ; extra == 'all'
+Requires-Dist: pandas>=2.0.0 ; extra == 'pandas'
+Requires-Dist: polars>=0.20.0 ; extra == 'polars'
+Requires-Python: >=3.13
+Provides-Extra: all
+Provides-Extra: pandas
+Provides-Extra: polars
+Description-Content-Type: text/markdown
+# CSVPredict SDK
+A Python SDK for the CSVPredict API - analyze and visualize your tabular data with ease.
+## Features
+- 📊 **Generate graphs** - Automatically create histograms, correlations, and statistical visualizations
+- 🔍 **Inspect data** - Get comprehensive statistics for numeric, string, datetime, and boolean columns
+- 🐼 **DataFrame support** - Works with pandas, Polars DataFrames, and Polars LazyFrames
+- 📓 **Jupyter integration** - Display graphs directly in notebooks with customizable sizing
+- 🔄 **Multiple output formats** - Export to Polars, pandas, numpy, or raw dictionaries
+- 🎯 **Flexible filtering** - Case-insensitive substring matching for columns, datatypes, and partitions
+## Installation
+```bash
+pip install csvpredict
+```
+Or with uv:
+```bash
+uv add csvpredict
+```
+## Quick Start
+```python
+import polars as pl
+from csvpredict_sdk import CSVPredict
+# Initialize the SDK
+sdk = CSVPredict(base_url="http://localhost:8000")
+# Load your data
+df = pl.read_csv("sales_data.csv")
+# Generate graphs
+graphs = sdk.generate_graphs(df)
+graphs.display()  # Show in Jupyter
+# Get statistics
+result = sdk.inspect(df)
+print(result.stats.overall.height)  # Number of rows
+print(result.stats.summary.numeric["price"])  # Price statistics
+```
+## API Reference
+### Initializing the SDK
+```python
+from csvpredict_sdk import CSVPredict
+sdk = CSVPredict(
+    base_url="http://localhost:8000",      # API server URL
+    frontend_url="http://localhost:3000",  # Frontend URL (for browser inspection)
+    timeout=60.0,                          # Request timeout in seconds
+)
+```
+---
+## Generating Graphs
+Generate statistical visualizations from your data.
+### Basic Usage
+```python
+import polars as pl
+from csvpredict_sdk import CSVPredict
+sdk = CSVPredict()
+df = pl.read_csv("data.csv")
+# Generate graphs with default settings (SVG format)
+graphs = sdk.generate_graphs(df)
+```
+### Options
+```python
+graphs = sdk.generate_graphs(
+    df,
+    partition_by=["category"],     # Generate separate graphs per category
+    extension=".png",              # Output format: .svg, .png, .jpg, .jpeg
+    transparent=True,              # Transparent background
+    window_size=7,                 # Rolling window size for time series
+    null_values=["N/A", "NULL"],   # Strings to treat as null
+    dpi=300,                       # Resolution for raster formats
+    font="Arial",                  # Font family
+    language="en",                 # Language for labels
+)
+```
+### Working with GraphResult
+The `generate_graphs()` method returns a `GraphResult` object with many ways to access your graphs:
+#### Display in Jupyter
+```python
+# Display all graphs in a grid
+graphs.display()
+# Display with custom size
+graphs.display(width=500, height=400)
+# Display in 3 columns
+graphs.display(columns=3)
+# Filter by name (case-insensitive substring match)
+graphs.display("histogram")           # Show all histograms
+graphs.display("price")               # Show price-related graphs
+graphs.display("correlation")         # Show correlation matrices
+# Filter by extension
+graphs.display(extension=".png")
+# Display a single graph
+graphs.display_one("price_histogram.svg", width=600)
+```
+#### Access Individual Graphs
+```python
+# Get list of all graph names
+print(graphs.names)
+# ['price_histogram.svg', 'quantity_histogram.svg', 'correlation_matrix.svg', ...]
+# Get raw bytes for a specific graph
+svg_bytes = graphs["price_histogram.svg"]
+# Check if a graph exists
+if graphs.contains("price_histogram.svg"):
+    print("Found it!")
+# Get with default
+data = graphs.get("missing.svg", default=None)
+# Number of graphs
+print(graphs.count())  # 15
+```
+#### Iterate and Filter
+```python
+# Iterate over all graphs
+for name, data in graphs:
+    print(f"{name}: {len(data)} bytes")
+# Filter by pattern (case-insensitive)
+for name, data in graphs.filter("histogram"):
+    print(name)
+# Filter by extension
+for name, data in graphs.filter(extension=".svg"):
+    print(name)
+```
+#### Convert to PIL Image (Raster Only)
+```python
+# Convert PNG/JPG to PIL Image for further processing
+img = graphs.to_pil("price_histogram.png")
+img.show()
+img.save("modified.png")
+```
+#### Save to Disk
+```python
+# Save as ZIP file
+graphs.save("graphs.zip")
+# Extract all to a directory
+extracted_files = graphs.extract("./output_graphs/")
+print(extracted_files)
+# [Path('output_graphs/price_histogram.svg'), ...]
+```
+---
+## Inspecting Data
+Get comprehensive statistics about your dataset.
+### Basic Usage
+```python
+result = sdk.inspect(df)
+# Access overall statistics
+print(result.stats.overall.height)          # Number of rows
+print(result.stats.overall.width)           # Number of columns
+print(result.stats.overall.null_count)      # Total null values
+print(result.stats.overall.duplicate_count) # Duplicate rows
+print(result.stats.overall.column_names)    # List of columns
+```
+### Options
+```python
+result = sdk.inspect(
+    df,
+    partition_by=["category", "region"],  # Partition data
+    window_size=5,                         # Rolling window size
+    round_digits=3,                        # Decimal precision
+    null_values=["N/A", ""],               # Strings to treat as null
+)
+```
+### Accessing Summary Statistics
+Summary statistics are organized by data type:
+```python
+# Numeric columns
+result.stats.summary.numeric["price"]           # Statistics for 'price' column
+result.stats.summary.numeric_count["price"]     # Value counts for 'price'
+result.stats.summary.numeric_rolling["price"]   # Rolling statistics (time series)
+# String columns
+result.stats.summary.string["product_name"]
+result.stats.summary.string_count["product_name"]
+# Datetime columns
+result.stats.summary.datetime["created_at"]
+result.stats.summary.datetime_count["created_at"]
+# Boolean columns
+result.stats.summary.boolean["is_active"]
+result.stats.summary.boolean_count["is_active"]
+# Duration columns
+result.stats.summary.duration["processing_time"]
+result.stats.summary.duration_count["processing_time"]
+```
+### Filtering Statistics
+All filters use case-insensitive substring matching:
+```python
+# Filter by column name
+result.stats.summary.get_stats_for_column("price")
+# Returns: {'price': DataFrame, 'unit_price': DataFrame, 'total_price': DataFrame}
+# Filter by column name - get value counts
+result.stats.summary.get_counts_for_column("status")
+# Returns: {'status': DataFrame, 'order_status': DataFrame}
+# Filter by column name - get rolling stats
+result.stats.summary.get_rolling_for_column("price")
+# Filter by datatype
+result.stats.summary.filter(datatype="numeric")
+# Returns: {'numeric': {'price': DataFrame, 'quantity': DataFrame, ...}}
+# Filter by both column and datatype
+result.stats.summary.filter(column="price", datatype="count")
+# Returns: {'numeric_count': {'price': DataFrame, 'unit_price': DataFrame}}
+# Access datatype directly
+result.stats.summary["numeric"]        # All numeric statistics
+result.stats.summary["string_count"]   # All string value counts
+```
+### Correlation Matrix
+```python
+# Get the full correlation matrix
+corr = result.stats.correlation.matrix
+print(corr)
+# Get correlation between two specific columns
+corr_value = result.stats.correlation.get_correlation("price", "quantity")
+print(f"Correlation: {corr_value}")
+# Get column names in the correlation matrix
+print(result.stats.correlation.columns)
+```
+### Working with Partitioned Data
+When you use `partition_by`, data is split into groups:
+```python
+result = sdk.inspect(df, partition_by=["category"])
+# Check if data is partitioned
+print(result.is_partitioned)  # True
+# List available partitions
+print(result.partitions)
+# ['_electronics', '_clothing', '_food']
+# Access a specific partition
+electronics_stats = result["_electronics"]
+print(electronics_stats.overall.height)
+print(electronics_stats.summary.numeric["price"])
+# Iterate over all partitions
+for partition_name, stats in result.items():
+    print(f"{partition_name}: {stats.overall.height} rows")
+# Filter partitions by name
+matching = result.filter_partitions("electronics")
+# Returns: {'_electronics': Statistics(...)}
+# Filter across partitions, columns, and datatypes
+filtered = result.filter(
+    partition="electronics",
+    column="price",
+    datatype="numeric"
+)
+```
+### Non-Partitioned Data
+For non-partitioned data, use `.stats` directly:
+```python
+result = sdk.inspect(df)  # No partition_by
+# Access statistics directly
+stats = result.stats
+print(stats.overall.height)
+print(stats.summary.numeric["price"])
+```
+---
+## Output Formats
+All statistics objects support multiple output formats:
+### Polars (Default)
+```python
+# DataFrames are Polars by default
+df = result.stats.summary.numeric["price"]
+print(type(df))  # <class 'polars.DataFrame'>
+# Explicitly convert to Polars
+polars_data = result.stats.to_polars()
+polars_lazy = result.stats.to_polars_lazy()  # LazyFrame for large data
+```
+### Pandas
+```python
+# Convert everything to pandas
+pandas_data = result.stats.to_pandas()
+# Convert specific statistics
+pandas_df = result.stats.summary.get_stats_for_column_pandas("price")
+pandas_counts = result.stats.summary.get_counts_for_column_pandas("status")
+# Correlation matrix with index
+corr_df = result.stats.correlation.to_pandas_indexed()
+```
+### NumPy
+```python
+# Convert to numpy arrays
+numpy_data = result.stats.to_numpy()
+# Correlation matrix as numpy array
+corr_array = result.stats.correlation.to_numpy()
+```
+### Dictionary
+```python
+# Convert to plain Python dictionaries
+dict_data = result.stats.to_dict()
+# Useful for JSON serialization
+import json
+json.dumps(result.stats.overall.to_dict())
+```
+---
+## DataFrame Compatibility
+The SDK works with multiple DataFrame types:
+### Polars DataFrame
+```python
+import polars as pl
+df = pl.read_csv("data.csv")
+result = sdk.inspect(df)
+graphs = sdk.generate_graphs(df)
+```
+### Polars LazyFrame
+```python
+import polars as pl
+# LazyFrames are collected automatically
+lf = pl.scan_csv("data.csv")
+result = sdk.inspect(lf)
+graphs = sdk.generate_graphs(lf)
+```
+### Pandas DataFrame
+```python
+import pandas as pd
+df = pd.read_csv("data.csv")
+result = sdk.inspect(df)
+graphs = sdk.generate_graphs(df)
+```
+---
+## Browser Inspection
+Open an interactive inspection view in your browser:
+```python
+# Inspect and open browser
+result = sdk.inspect_in_browser(df)
+# With partitioning
+result = sdk.inspect_in_browser(df, partition_by=["category"])
+```
+---
+## Complete Example
+```python
+import polars as pl
+from csvpredict_sdk import CSVPredict
+# Initialize SDK
+sdk = CSVPredict(base_url="http://localhost:8000")
+# Load data
+df = pl.read_csv("sales_data.csv")
+# Generate and display graphs
+graphs = sdk.generate_graphs(df, extension=".svg")
+graphs.display(width=400, columns=2)
+# Filter to show only histograms
+graphs.display("histogram", width=300)
+# Save graphs to disk
+graphs.save("sales_graphs.zip")
+# Inspect data
+result = sdk.inspect(df)
+# Overall statistics
+print(f"Rows: {result.stats.overall.height}")
+print(f"Columns: {result.stats.overall.width}")
+print(f"Null values: {result.stats.overall.null_count}")
+# Numeric statistics for price column
+price_stats = result.stats.summary.numeric["price"]
+print(price_stats)
+# Get all price-related statistics
+price_data = result.stats.summary.get_stats_for_column("price")
+for col_name, stats_df in price_data.items():
+    print(f"\n{col_name}:")
+    print(stats_df)
+# Correlation analysis
+print("\nCorrelation Matrix:")
+print(result.stats.correlation.matrix)
+# Get specific correlation
+corr = result.stats.correlation.get_correlation("price", "quantity")
+print(f"\nPrice-Quantity correlation: {corr}")
+# Export to different formats
+pandas_stats = result.stats.to_pandas()
+numpy_stats = result.stats.to_numpy()
+dict_stats = result.stats.to_dict()
+# Partitioned analysis
+result = sdk.inspect(df, partition_by=["category"])
+for partition, stats in result.items():
+    print(f"\n{partition}: {stats.overall.height} rows")
+    print(stats.summary.numeric["price"])
+```
+---
+## Jupyter Notebook Tips
+### Optimal Display Settings
+```python
+# For high-DPI displays
+graphs.display(width=600, height=450, columns=2)
+# For presentations
+graphs.display(width=800, height=600, columns=1)
+# Quick overview
+graphs.display(width=300, height=200, columns=4)
+```
+### Filtering in Notebooks
+```python
+# Show only correlation graphs
+graphs.display("correlation")
+# Show only a specific column's graphs
+graphs.display("price")
+# Show histograms at smaller size
+graphs.display("histogram", width=250, columns=3)
+```
+---
+## Error Handling
+```python
+from csvpredict_sdk import CSVPredict
+sdk = CSVPredict()
+try:
+    result = sdk.inspect(df)
+except ValueError as e:
+    print(f"API error: {e}")
+except Exception as e:
+    print(f"Unexpected error: {e}")
+```
+---
+## License
+MIT License - see LICENSE file for details.