schemalytics 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,420 @@
1
+ Metadata-Version: 2.4
2
+ Name: schemalytics
3
+ Version: 0.1.0
4
+ Summary: Automated dbt project generation from PostgreSQL schemas with semantic layer
5
+ License: Apache-2.0
6
+ Requires-Python: >=3.10
7
+ Description-Content-Type: text/markdown
8
+ Requires-Dist: click>=8.0
9
+ Requires-Dist: pydantic>=2.0
10
+ Requires-Dist: sqlalchemy>=2.0
11
+ Requires-Dist: psycopg2-binary>=2.9
12
+ Requires-Dist: httpx>=0.24
13
+ Requires-Dist: jinja2>=3.0
14
+ Requires-Dist: pyyaml>=6.0
15
+ Provides-Extra: dev
16
+ Requires-Dist: pytest>=7.0; extra == "dev"
17
+ Requires-Dist: pytest-cov>=4.0; extra == "dev"
18
+ Requires-Dist: ruff>=0.1; extra == "dev"
19
+
20
+ # Schemalytics
21
+
22
+ **Automated dbt project generation from PostgreSQL schemas with comprehensive semantic layer for LLM-powered analytics.**
23
+
24
+ Schemalytics extracts your PostgreSQL database schema, intelligently classifies tables as facts or dimensions using AI-powered analysis, and generates a production-ready dbt project with medallion architecture (Bronze → Silver → Gold). The tool creates a detailed semantic layer that enables LLMs to understand your data model and generate accurate SQL queries for self-service analytics.
25
+
26
+ ## Key Features
27
+
28
+ ✨ **Automated Data Modeling** - Extracts schemas and generates dbt projects automatically
29
+ 🏗️ **Medallion Architecture** - Bronze (raw) → Silver (dimensional) → Gold (aggregated)
30
+ 🤖 **AI-Enhanced Classification** - Uses local LLMs (Ollama) to validate table classifications
31
+ 🔒 **Privacy-First** - All processing happens locally, no data leaves your machine
32
+ 📊 **Comprehensive Semantic Layer** - 500+ lines of LLM-ready metadata for accurate queries
33
+ 📝 **Complete dbt Documentation** - Auto-generated schema.yml files with tests and descriptions
34
+ 🎯 **Industry Templates** - 50+ pre-configured industry patterns (E-commerce, SaaS, Finance, etc.)
35
+
36
+ ## Installation
37
+
38
+ ### Prerequisites
39
+
40
+ - **Python 3.10+**
41
+ - **PostgreSQL database** (with data to model)
42
+ - **Ollama** with AI models for intelligent classification
43
+
44
+ ### 1. Install Ollama & Models
45
+
46
+ ```bash
47
+ # Install Ollama (https://ollama.ai)
48
+ # macOS/Linux:
49
+ curl -fsSL https://ollama.com/install.sh | sh
50
+
51
+ # Pull required models
52
+ ollama pull qwen-data:latest
53
+ ollama pull qwen2.5-coder:7b # Fallback model
54
+ ```
55
+
56
+ ### 2. Install Schemalytics
57
+
58
+ ```bash
59
+ # Clone the repository
60
+ git clone https://github.com/yourusername/schemalytics.git
61
+ cd schemalytics
62
+
63
+ # Install in development mode
64
+ pip install -e .
65
+ ```
66
+
67
+ ### 3. Verify Installation
68
+
69
+ ```bash
70
+ # Check if schemalytics is installed
71
+ schemalytics --version
72
+
73
+ # Verify Ollama is running
74
+ ollama list
75
+ ```
76
+
77
+ ## Quick Start
78
+
79
+ ### One-Command Generation (Interactive)
80
+
81
+ ```bash
82
+ schemalytics generate \
83
+ --connection postgresql://user:password@localhost:5432/mydb \
84
+ --output ./my_dbt_project \
85
+ --name my_project
86
+ ```
87
+
88
+ **You'll be prompted for:**
89
+ 1. **Industry** - Select from 14 main industries (E-commerce, SaaS, Finance, etc.)
90
+ 2. **Sub-industry** - Choose specific business type (B2C, B2B, Marketplace, etc.)
91
+ 3. **Entities** - Review and edit suggested entities (customers, orders, products)
92
+ 4. **Goals** - Review and edit analytical goals (revenue_reporting, customer_ltv)
93
+ 5. **Temporal Tracking** - Choose SCD type (snapshot, historical, or both)
94
+ 6. **Time Grains** - Select Gold layer aggregations (daily, weekly, monthly, yearly)
95
+
96
+ The tool will:
97
+ - Extract your database schema (14 tables in ~2 seconds)
98
+ - Classify tables as facts/dimensions using AI (5 facts, 9 dimensions)
99
+ - Generate Gold layer aggregates based on your selections
100
+ - Create a complete dbt project with semantic layer
101
+
102
+ ### Using Pre-Created Context
103
+
104
+ ```bash
105
+ # Create context.yaml
106
+ cat > context.yaml << EOF
107
+ business_type: ecommerce_retail_b2c
108
+ entities: [customers, orders, products, order_items]
109
+ goals: [revenue_reporting, customer_lifetime_value, inventory_tracking]
110
+ temporal: historical
111
+ grain: daily,weekly,monthly
112
+ EOF
113
+
114
+ # Generate with context file
115
+ schemalytics generate \
116
+ --connection postgresql://user:password@localhost:5432/mydb \
117
+ --context context.yaml \
118
+ --output ./my_dbt_project
119
+ ```
120
+
121
+ ### Step-by-Step Workflow
122
+
123
+ ```bash
124
+ # 1. Extract schema
125
+ schemalytics extract \
126
+ --connection postgresql://user:password@localhost:5432/mydb \
127
+ --output schema.json
128
+
129
+ # 2. Generate modeling plan (with AI validation)
130
+ schemalytics plan \
131
+ --schema schema.json \
132
+ --context context.yaml \
133
+ --output plan.yaml
134
+
135
+ # 3. Build dbt project
136
+ schemalytics build \
137
+ --schema schema.json \
138
+ --plan plan.yaml \
139
+ --context context.yaml \
140
+ --output ./dbt_project
141
+ ```
142
+
143
+ ## Generated Project Structure
144
+
145
+ ```
146
+ dbt_project/
147
+ ├── dbt_project.yml # dbt project configuration
148
+ ├── semantic_layer.yml # Comprehensive LLM-ready metadata (500+ lines)
149
+ ├── models/
150
+ │ ├── sources.yml # Source definitions
151
+ │ ├── bronze/ # Raw passthrough (views)
152
+ │ │ ├── schema.yml # Bronze documentation
153
+ │ │ └── bronze_*.sql
154
+ │ ├── silver/
155
+ │ │ ├── dimensions/ # SCD1/SCD2 dimensions
156
+ │ │ │ ├── schema.yml # Dimension documentation with tests
157
+ │ │ │ └── dim_*.sql
158
+ │ │ └── facts/ # Fact tables
159
+ │ │ ├── schema.yml # Fact documentation with tests
160
+ │ │ └── fct_*.sql
161
+ │ └── gold/ # Pre-aggregated metrics
162
+ │ ├── schema.yml # Metrics documentation
163
+ │ └── gold_*.sql
164
+ ├── tests/
165
+ ├── macros/
166
+ └── README.md
167
+ ```
168
+
169
+ ## Semantic Layer for LLM Analytics
170
+
171
+ The generated `semantic_layer.yml` provides comprehensive metadata including:
172
+
173
+ ### Metrics Catalog
174
+ - SQL formulas and aggregation types
175
+ - Use cases and example queries
176
+ - Data types and null handling rules
177
+ - Common filters and time ranges
178
+
179
+ ### Dimensional Model
180
+ - Complete fact and dimension documentation
181
+ - Grain definitions and relationships
182
+ - Join patterns and cardinality
183
+ - SCD type information
184
+
185
+ ### Query Guidelines
186
+ - Query strategy (Gold → Silver → Bronze)
187
+ - Performance optimization tips
188
+ - Common mistakes and solutions
189
+ - Date and null handling rules
190
+
191
+ ### Query Library
192
+ - Pre-built analytical queries
193
+ - Period-over-period comparisons
194
+ - Cross-fact analysis patterns
195
+
196
+ ### Example LLM Usage
197
+
198
+ An LLM can read `semantic_layer.yml` to understand:
199
+ - Available metrics (daily_revenue, monthly_sales, customer_ltv)
200
+ - Time grains (daily, weekly, monthly, yearly)
201
+ - Dimension relationships and join paths
202
+ - Pre-calculated aggregations in Gold layer
203
+
204
+ Then generate accurate SQL:
205
+ ```sql
206
+ -- LLM understands to query Gold layer first
207
+ SELECT
208
+ daily_date,
209
+ total_revenue,
210
+ order_count,
211
+ avg_order_value
212
+ FROM gold_daily_revenue
213
+ WHERE daily_date >= CURRENT_DATE - 30
214
+ ORDER BY daily_date DESC
215
+ ```
216
+
217
+ ## How It Works
218
+
219
+ ### 1. Schema Extraction
220
+ SQLAlchemy inspects your PostgreSQL database and extracts:
221
+ - Tables, columns, and data types
222
+ - Primary keys and foreign keys
223
+ - Relationships and constraints
224
+
225
+ ### 2. Intelligent Classification
226
+ **Heuristic Analysis:**
227
+ - FK graph analysis identifies patterns
228
+ - Tables with many outgoing FKs → Facts
229
+ - Tables with many incoming FKs → Dimensions
230
+
231
+ **AI Validation:**
232
+ - Local LLM (Ollama) validates classifications
233
+ - Provides reasoning for each decision
234
+ - Suggests corrections for ambiguous cases
235
+
236
+ ### 3. Interactive Review
237
+ - User reviews proposed model
238
+ - Can edit table classifications
239
+ - Accepts or rejects plan before generation
240
+
241
+ ### 4. Gold Layer Generation
242
+ **AI-Powered Metrics:**
243
+ - LLM suggests common aggregations based on industry
244
+ - Generates metrics aligned with analytical goals
245
+
246
+ **Heuristic Fallback:**
247
+ - Time-based aggregates (daily, weekly, monthly, yearly)
248
+ - Business-specific patterns (e-commerce, SaaS metrics)
249
+
250
+ ### 5. Template-Based SQL Generation
251
+ - Jinja2 templates ensure syntactically correct SQL
252
+ - LLM fills parameters, doesn't write SQL from scratch
253
+ - Guarantees production-ready, tested code
254
+
255
+ ## Industry Support
256
+
257
+ ### Available Industries (14 Main Categories)
258
+
259
+ 1. **E-commerce & Retail** - B2C, B2B, Marketplace, Subscription
260
+ 2. **SaaS & Software** - B2B, B2C, Platform, Collaboration
261
+ 3. **Finance & Fintech** - Banking, Payments, Lending, Investment, Crypto, Insurance
262
+ 4. **Healthcare** - Provider, Telehealth, Pharmacy, Health Apps
263
+ 5. **Media & Entertainment** - Streaming, Gaming, Social Media, Publishing
264
+ 6. **Marketing & Advertising** - Automation, Ad Networks, Email, Influencer
265
+ 7. **Education** - K-12, Higher Ed, Online Courses, Corporate Training
266
+ 8. **Logistics & Transportation** - Shipping, Warehouse, Rideshare, Delivery
267
+ 9. **Hospitality & Travel** - Hotels, Booking, Restaurants, Vacation Rentals
268
+ 10. **Real Estate** - Residential, Commercial, Property Management
269
+ 11. **Manufacturing** - Production, Supply Chain, Inventory
270
+ 12. **Human Resources** - HRIS, Recruiting, Payroll, Talent Marketplaces
271
+ 13. **Nonprofit & Government** - Fundraising, Public Services
272
+ 14. **Other/Custom** - Generic business patterns
273
+
274
+ Each industry includes:
275
+ - Pre-configured entities
276
+ - Analytical goals
277
+ - Common metrics
278
+ - Best practices
279
+
280
+ ## Architecture Decisions
281
+
282
+ ### Why Local LLM?
283
+ - **Privacy** - No data sent to external APIs
284
+ - **Cost** - Zero API fees
285
+ - **Control** - Works offline, no rate limits
286
+ - **Speed** - Optimized for consumer hardware (8GB RAM MacBook)
287
+
288
+ ### Why Template-Based SQL?
289
+ - **Reliability** - Guarantees syntactically correct SQL
290
+ - **Consistency** - Follows dbt best practices
291
+ - **Maintainability** - Easy to update and extend
292
+ - **Quality** - Production-tested patterns
293
+
294
+ ### Why Gold + Semantic Layer?
295
+ - **Performance** - Pre-aggregated metrics (10-100x faster)
296
+ - **LLM-Ready** - Structured metadata for accurate queries
297
+ - **Self-Service** - Enables non-technical analytics
298
+ - **Scalability** - Reduces query complexity
299
+
300
+ ## Configuration
301
+
302
+ ### Connection String Format
303
+
304
+ ```bash
305
+ postgresql://username:password@hostname:port/database
306
+
307
+ # Examples:
308
+ postgresql://postgres:mypassword@localhost:5432/mydb
309
+ postgresql://user@localhost/mydb # No password
310
+ postgresql://user:pass@remote.host:5432/db
311
+ ```
312
+
313
+ ### Context File Options
314
+
315
+ ```yaml
316
+ business_type: ecommerce_retail_b2c # Industry_subindustry format
317
+ entities:
318
+ - customers
319
+ - orders
320
+ - products
321
+ goals:
322
+ - revenue_reporting
323
+ - customer_lifetime_value
324
+ - inventory_tracking
325
+ temporal: historical # snapshot | historical | both
326
+ grain: daily,weekly,monthly # Comma-separated time grains
327
+ ```
328
+
329
+ ## Development
330
+
331
+ ```bash
332
+ # Install in editable mode with dev dependencies
333
+ pip install -e ".[dev]"
334
+
335
+ # Run tests
336
+ pytest
337
+
338
+ # Format code
339
+ ruff format .
340
+
341
+ # Type checking
342
+ mypy schemalytics
343
+ ```
344
+
345
+ ## Troubleshooting
346
+
347
+ ### Ollama Not Running
348
+ ```bash
349
+ # Check if Ollama is running
350
+ curl http://localhost:11434/api/tags
351
+
352
+ # Start Ollama
353
+ ollama serve
354
+ ```
355
+
356
+ ### Database Connection Issues
357
+ ```bash
358
+ # Test connection with psql
359
+ psql postgresql://user:pass@localhost:5432/mydb
360
+
361
+ # Check if database exists
362
+ psql -U postgres -l
363
+ ```
364
+
365
+ ### Model Not Found
366
+ ```bash
367
+ # List available models
368
+ ollama list
369
+
370
+ # Pull required model
371
+ ollama pull qwen-data:latest
372
+ ```
373
+
374
+ ### Timeout Issues
375
+ Default LLM timeout is 15 minutes. For large schemas, this may need adjustment in `llm.py`:
376
+ ```python
377
+ LLM_TIMEOUT = 900.0 # Increase if needed
378
+ ```
379
+
380
+ ## Roadmap
381
+
382
+ - [ ] Support additional databases (Snowflake, BigQuery, DuckDB)
383
+ - [ ] Web UI for interactive modeling
384
+ - [ ] Advanced SCD types (Type 3, Type 6)
385
+ - [ ] Data profiling and quality checks
386
+ - [ ] Custom business logic templates
387
+ - [ ] dbt Cloud integration
388
+ - [ ] Incremental model generation
389
+ - [ ] Multi-tenant support
390
+
391
+ ## Contributing
392
+
393
+ Contributions welcome! Please:
394
+ 1. Fork the repository
395
+ 2. Create a feature branch
396
+ 3. Add tests for new functionality
397
+ 4. Submit a pull request
398
+
399
+ ## License
400
+
401
+ MIT License - see LICENSE file for details
402
+
403
+ ## Support
404
+
405
+ - **Issues**: [GitHub Issues](https://github.com/yourusername/schemalytics/issues)
406
+ - **Documentation**: See `semantic_layer.yml` in generated projects
407
+ - **Examples**: Check the `examples/` directory
408
+
409
+ ## Credits
410
+
411
+ Built with:
412
+ - [SQLAlchemy](https://www.sqlalchemy.org/) - Database inspection
413
+ - [Ollama](https://ollama.ai/) - Local LLM inference
414
+ - [Jinja2](https://jinja.palletsprojects.com/) - SQL templating
415
+ - [Click](https://click.palletsprojects.com/) - CLI framework
416
+ - [Pydantic](https://docs.pydantic.dev/) - Data validation
417
+
418
+ ---
419
+
420
+ **Schemalytics** - Transform your database into an LLM-ready analytics platform in minutes.