schemalytics 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- schemalytics-0.1.0/PKG-INFO +420 -0
- schemalytics-0.1.0/README.md +401 -0
- schemalytics-0.1.0/mnt/user-data/outputs/schemalytics/schemalytics/extractors/__init__.py +4 -0
- schemalytics-0.1.0/mnt/user-data/outputs/schemalytics/schemalytics/generators/__init__.py +4 -0
- schemalytics-0.1.0/pyproject.toml +37 -0
- schemalytics-0.1.0/schemalytics/__init__.py +13 -0
- schemalytics-0.1.0/schemalytics/bin/daff.py +11189 -0
- schemalytics-0.1.0/schemalytics/cli.py +127 -0
- schemalytics-0.1.0/schemalytics/extractors/__init__.py +4 -0
- schemalytics-0.1.0/schemalytics/extractors/postgres.py +42 -0
- schemalytics-0.1.0/schemalytics/generators/__init__.py +4 -0
- schemalytics-0.1.0/schemalytics/generators/dbt.py +244 -0
- schemalytics-0.1.0/schemalytics/industry_taxonomy.py +381 -0
- schemalytics-0.1.0/schemalytics/llm.py +67 -0
- schemalytics-0.1.0/schemalytics/models.py +78 -0
- schemalytics-0.1.0/schemalytics/planner.py +760 -0
- schemalytics-0.1.0/schemalytics/templates.py +599 -0
- schemalytics-0.1.0/schemalytics.egg-info/PKG-INFO +420 -0
- schemalytics-0.1.0/schemalytics.egg-info/SOURCES.txt +22 -0
- schemalytics-0.1.0/schemalytics.egg-info/dependency_links.txt +1 -0
- schemalytics-0.1.0/schemalytics.egg-info/entry_points.txt +2 -0
- schemalytics-0.1.0/schemalytics.egg-info/requires.txt +12 -0
- schemalytics-0.1.0/schemalytics.egg-info/top_level.txt +5 -0
- schemalytics-0.1.0/setup.cfg +4 -0
|
@@ -0,0 +1,420 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: schemalytics
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Automated dbt project generation from PostgreSQL schemas with semantic layer
|
|
5
|
+
License: Apache-2.0
|
|
6
|
+
Requires-Python: >=3.10
|
|
7
|
+
Description-Content-Type: text/markdown
|
|
8
|
+
Requires-Dist: click>=8.0
|
|
9
|
+
Requires-Dist: pydantic>=2.0
|
|
10
|
+
Requires-Dist: sqlalchemy>=2.0
|
|
11
|
+
Requires-Dist: psycopg2-binary>=2.9
|
|
12
|
+
Requires-Dist: httpx>=0.24
|
|
13
|
+
Requires-Dist: jinja2>=3.0
|
|
14
|
+
Requires-Dist: pyyaml>=6.0
|
|
15
|
+
Provides-Extra: dev
|
|
16
|
+
Requires-Dist: pytest>=7.0; extra == "dev"
|
|
17
|
+
Requires-Dist: pytest-cov>=4.0; extra == "dev"
|
|
18
|
+
Requires-Dist: ruff>=0.1; extra == "dev"
|
|
19
|
+
|
|
20
|
+
# Schemalytics
|
|
21
|
+
|
|
22
|
+
**Automated dbt project generation from PostgreSQL schemas with comprehensive semantic layer for LLM-powered analytics.**
|
|
23
|
+
|
|
24
|
+
Schemalytics extracts your PostgreSQL database schema, intelligently classifies tables as facts or dimensions using AI-powered analysis, and generates a production-ready dbt project with medallion architecture (Bronze → Silver → Gold). The tool creates a detailed semantic layer that enables LLMs to understand your data model and generate accurate SQL queries for self-service analytics.
|
|
25
|
+
|
|
26
|
+
## Key Features
|
|
27
|
+
|
|
28
|
+
✨ **Automated Data Modeling** - Extracts schemas and generates dbt projects automatically
|
|
29
|
+
🏗️ **Medallion Architecture** - Bronze (raw) → Silver (dimensional) → Gold (aggregated)
|
|
30
|
+
🤖 **AI-Enhanced Classification** - Uses local LLMs (Ollama) to validate table classifications
|
|
31
|
+
🔒 **Privacy-First** - All processing happens locally, no data leaves your machine
|
|
32
|
+
📊 **Comprehensive Semantic Layer** - 500+ lines of LLM-ready metadata for accurate queries
|
|
33
|
+
📝 **Complete dbt Documentation** - Auto-generated schema.yml files with tests and descriptions
|
|
34
|
+
🎯 **Industry Templates** - 50+ pre-configured industry patterns (E-commerce, SaaS, Finance, etc.)
|
|
35
|
+
|
|
36
|
+
## Installation
|
|
37
|
+
|
|
38
|
+
### Prerequisites
|
|
39
|
+
|
|
40
|
+
- **Python 3.10+**
|
|
41
|
+
- **PostgreSQL database** (with data to model)
|
|
42
|
+
- **Ollama** with AI models for intelligent classification
|
|
43
|
+
|
|
44
|
+
### 1. Install Ollama & Models
|
|
45
|
+
|
|
46
|
+
```bash
|
|
47
|
+
# Install Ollama (https://ollama.ai)
|
|
48
|
+
# macOS/Linux:
|
|
49
|
+
curl -fsSL https://ollama.com/install.sh | sh
|
|
50
|
+
|
|
51
|
+
# Pull required models
|
|
52
|
+
ollama pull qwen-data:latest
|
|
53
|
+
ollama pull qwen2.5-coder:7b # Fallback model
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
### 2. Install Schemalytics
|
|
57
|
+
|
|
58
|
+
```bash
|
|
59
|
+
# Clone the repository
|
|
60
|
+
git clone https://github.com/yourusername/schemalytics.git
|
|
61
|
+
cd schemalytics
|
|
62
|
+
|
|
63
|
+
# Install in development mode
|
|
64
|
+
pip install -e .
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
### 3. Verify Installation
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
# Check if schemalytics is installed
|
|
71
|
+
schemalytics --version
|
|
72
|
+
|
|
73
|
+
# Verify Ollama is running
|
|
74
|
+
ollama list
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
## Quick Start
|
|
78
|
+
|
|
79
|
+
### One-Command Generation (Interactive)
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
schemalytics generate \
|
|
83
|
+
--connection postgresql://user:password@localhost:5432/mydb \
|
|
84
|
+
--output ./my_dbt_project \
|
|
85
|
+
--name my_project
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
**You'll be prompted for:**
|
|
89
|
+
1. **Industry** - Select from 14 main industries (E-commerce, SaaS, Finance, etc.)
|
|
90
|
+
2. **Sub-industry** - Choose specific business type (B2C, B2B, Marketplace, etc.)
|
|
91
|
+
3. **Entities** - Review and edit suggested entities (customers, orders, products)
|
|
92
|
+
4. **Goals** - Review and edit analytical goals (revenue_reporting, customer_ltv)
|
|
93
|
+
5. **Temporal Tracking** - Choose SCD type (snapshot, historical, or both)
|
|
94
|
+
6. **Time Grains** - Select Gold layer aggregations (daily, weekly, monthly, yearly)
|
|
95
|
+
|
|
96
|
+
The tool will:
|
|
97
|
+
- Extract your database schema (14 tables in ~2 seconds)
|
|
98
|
+
- Classify tables as facts/dimensions using AI (5 facts, 9 dimensions)
|
|
99
|
+
- Generate Gold layer aggregates based on your selections
|
|
100
|
+
- Create a complete dbt project with semantic layer
|
|
101
|
+
|
|
102
|
+
### Using Pre-Created Context
|
|
103
|
+
|
|
104
|
+
```bash
|
|
105
|
+
# Create context.yaml
|
|
106
|
+
cat > context.yaml << EOF
|
|
107
|
+
business_type: ecommerce_retail_b2c
|
|
108
|
+
entities: [customers, orders, products, order_items]
|
|
109
|
+
goals: [revenue_reporting, customer_lifetime_value, inventory_tracking]
|
|
110
|
+
temporal: historical
|
|
111
|
+
grain: daily,weekly,monthly
|
|
112
|
+
EOF
|
|
113
|
+
|
|
114
|
+
# Generate with context file
|
|
115
|
+
schemalytics generate \
|
|
116
|
+
--connection postgresql://user:password@localhost:5432/mydb \
|
|
117
|
+
--context context.yaml \
|
|
118
|
+
--output ./my_dbt_project
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### Step-by-Step Workflow
|
|
122
|
+
|
|
123
|
+
```bash
|
|
124
|
+
# 1. Extract schema
|
|
125
|
+
schemalytics extract \
|
|
126
|
+
--connection postgresql://user:password@localhost:5432/mydb \
|
|
127
|
+
--output schema.json
|
|
128
|
+
|
|
129
|
+
# 2. Generate modeling plan (with AI validation)
|
|
130
|
+
schemalytics plan \
|
|
131
|
+
--schema schema.json \
|
|
132
|
+
--context context.yaml \
|
|
133
|
+
--output plan.yaml
|
|
134
|
+
|
|
135
|
+
# 3. Build dbt project
|
|
136
|
+
schemalytics build \
|
|
137
|
+
--schema schema.json \
|
|
138
|
+
--plan plan.yaml \
|
|
139
|
+
--context context.yaml \
|
|
140
|
+
--output ./dbt_project
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
## Generated Project Structure
|
|
144
|
+
|
|
145
|
+
```
|
|
146
|
+
dbt_project/
|
|
147
|
+
├── dbt_project.yml # dbt project configuration
|
|
148
|
+
├── semantic_layer.yml # Comprehensive LLM-ready metadata (500+ lines)
|
|
149
|
+
├── models/
|
|
150
|
+
│ ├── sources.yml # Source definitions
|
|
151
|
+
│ ├── bronze/ # Raw passthrough (views)
|
|
152
|
+
│ │ ├── schema.yml # Bronze documentation
|
|
153
|
+
│ │ └── bronze_*.sql
|
|
154
|
+
│ ├── silver/
|
|
155
|
+
│ │ ├── dimensions/ # SCD1/SCD2 dimensions
|
|
156
|
+
│ │ │ ├── schema.yml # Dimension documentation with tests
|
|
157
|
+
│ │ │ └── dim_*.sql
|
|
158
|
+
│ │ └── facts/ # Fact tables
|
|
159
|
+
│ │ ├── schema.yml # Fact documentation with tests
|
|
160
|
+
│ │ └── fct_*.sql
|
|
161
|
+
│ └── gold/ # Pre-aggregated metrics
|
|
162
|
+
│ ├── schema.yml # Metrics documentation
|
|
163
|
+
│ └── gold_*.sql
|
|
164
|
+
├── tests/
|
|
165
|
+
├── macros/
|
|
166
|
+
└── README.md
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
## Semantic Layer for LLM Analytics
|
|
170
|
+
|
|
171
|
+
The generated `semantic_layer.yml` provides comprehensive metadata including:
|
|
172
|
+
|
|
173
|
+
### Metrics Catalog
|
|
174
|
+
- SQL formulas and aggregation types
|
|
175
|
+
- Use cases and example queries
|
|
176
|
+
- Data types and null handling rules
|
|
177
|
+
- Common filters and time ranges
|
|
178
|
+
|
|
179
|
+
### Dimensional Model
|
|
180
|
+
- Complete fact and dimension documentation
|
|
181
|
+
- Grain definitions and relationships
|
|
182
|
+
- Join patterns and cardinality
|
|
183
|
+
- SCD type information
|
|
184
|
+
|
|
185
|
+
### Query Guidelines
|
|
186
|
+
- Query strategy (Gold → Silver → Bronze)
|
|
187
|
+
- Performance optimization tips
|
|
188
|
+
- Common mistakes and solutions
|
|
189
|
+
- Date and null handling rules
|
|
190
|
+
|
|
191
|
+
### Query Library
|
|
192
|
+
- Pre-built analytical queries
|
|
193
|
+
- Period-over-period comparisons
|
|
194
|
+
- Cross-fact analysis patterns
|
|
195
|
+
|
|
196
|
+
### Example LLM Usage
|
|
197
|
+
|
|
198
|
+
An LLM can read `semantic_layer.yml` to understand:
|
|
199
|
+
- Available metrics (daily_revenue, monthly_sales, customer_ltv)
|
|
200
|
+
- Time grains (daily, weekly, monthly, yearly)
|
|
201
|
+
- Dimension relationships and join paths
|
|
202
|
+
- Pre-calculated aggregations in Gold layer
|
|
203
|
+
|
|
204
|
+
Then generate accurate SQL:
|
|
205
|
+
```sql
|
|
206
|
+
-- LLM understands to query Gold layer first
|
|
207
|
+
SELECT
|
|
208
|
+
daily_date,
|
|
209
|
+
total_revenue,
|
|
210
|
+
order_count,
|
|
211
|
+
avg_order_value
|
|
212
|
+
FROM gold_daily_revenue
|
|
213
|
+
WHERE daily_date >= CURRENT_DATE - 30
|
|
214
|
+
ORDER BY daily_date DESC
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
## How It Works
|
|
218
|
+
|
|
219
|
+
### 1. Schema Extraction
|
|
220
|
+
SQLAlchemy inspects your PostgreSQL database and extracts:
|
|
221
|
+
- Tables, columns, and data types
|
|
222
|
+
- Primary keys and foreign keys
|
|
223
|
+
- Relationships and constraints
|
|
224
|
+
|
|
225
|
+
### 2. Intelligent Classification
|
|
226
|
+
**Heuristic Analysis:**
|
|
227
|
+
- FK graph analysis identifies patterns
|
|
228
|
+
- Tables with many outgoing FKs → Facts
|
|
229
|
+
- Tables with many incoming FKs → Dimensions
|
|
230
|
+
|
|
231
|
+
**AI Validation:**
|
|
232
|
+
- Local LLM (Ollama) validates classifications
|
|
233
|
+
- Provides reasoning for each decision
|
|
234
|
+
- Suggests corrections for ambiguous cases
|
|
235
|
+
|
|
236
|
+
### 3. Interactive Review
|
|
237
|
+
- User reviews proposed model
|
|
238
|
+
- Can edit table classifications
|
|
239
|
+
- Accepts or rejects plan before generation
|
|
240
|
+
|
|
241
|
+
### 4. Gold Layer Generation
|
|
242
|
+
**AI-Powered Metrics:**
|
|
243
|
+
- LLM suggests common aggregations based on industry
|
|
244
|
+
- Generates metrics aligned with analytical goals
|
|
245
|
+
|
|
246
|
+
**Heuristic Fallback:**
|
|
247
|
+
- Time-based aggregates (daily, weekly, monthly, yearly)
|
|
248
|
+
- Business-specific patterns (e-commerce, SaaS metrics)
|
|
249
|
+
|
|
250
|
+
### 5. Template-Based SQL Generation
|
|
251
|
+
- Jinja2 templates ensure syntactically correct SQL
|
|
252
|
+
- LLM fills parameters, doesn't write SQL from scratch
|
|
253
|
+
- Guarantees production-ready, tested code
|
|
254
|
+
|
|
255
|
+
## Industry Support
|
|
256
|
+
|
|
257
|
+
### Available Industries (14 Main Categories)
|
|
258
|
+
|
|
259
|
+
1. **E-commerce & Retail** - B2C, B2B, Marketplace, Subscription
|
|
260
|
+
2. **SaaS & Software** - B2B, B2C, Platform, Collaboration
|
|
261
|
+
3. **Finance & Fintech** - Banking, Payments, Lending, Investment, Crypto, Insurance
|
|
262
|
+
4. **Healthcare** - Provider, Telehealth, Pharmacy, Health Apps
|
|
263
|
+
5. **Media & Entertainment** - Streaming, Gaming, Social Media, Publishing
|
|
264
|
+
6. **Marketing & Advertising** - Automation, Ad Networks, Email, Influencer
|
|
265
|
+
7. **Education** - K-12, Higher Ed, Online Courses, Corporate Training
|
|
266
|
+
8. **Logistics & Transportation** - Shipping, Warehouse, Rideshare, Delivery
|
|
267
|
+
9. **Hospitality & Travel** - Hotels, Booking, Restaurants, Vacation Rentals
|
|
268
|
+
10. **Real Estate** - Residential, Commercial, Property Management
|
|
269
|
+
11. **Manufacturing** - Production, Supply Chain, Inventory
|
|
270
|
+
12. **Human Resources** - HRIS, Recruiting, Payroll, Talent Marketplaces
|
|
271
|
+
13. **Nonprofit & Government** - Fundraising, Public Services
|
|
272
|
+
14. **Other/Custom** - Generic business patterns
|
|
273
|
+
|
|
274
|
+
Each industry includes:
|
|
275
|
+
- Pre-configured entities
|
|
276
|
+
- Analytical goals
|
|
277
|
+
- Common metrics
|
|
278
|
+
- Best practices
|
|
279
|
+
|
|
280
|
+
## Architecture Decisions
|
|
281
|
+
|
|
282
|
+
### Why Local LLM?
|
|
283
|
+
- **Privacy** - No data sent to external APIs
|
|
284
|
+
- **Cost** - Zero API fees
|
|
285
|
+
- **Control** - Works offline, no rate limits
|
|
286
|
+
- **Speed** - Optimized for consumer hardware (8GB RAM MacBook)
|
|
287
|
+
|
|
288
|
+
### Why Template-Based SQL?
|
|
289
|
+
- **Reliability** - Guarantees syntactically correct SQL
|
|
290
|
+
- **Consistency** - Follows dbt best practices
|
|
291
|
+
- **Maintainability** - Easy to update and extend
|
|
292
|
+
- **Quality** - Production-tested patterns
|
|
293
|
+
|
|
294
|
+
### Why Gold + Semantic Layer?
|
|
295
|
+
- **Performance** - Pre-aggregated metrics (10-100x faster)
|
|
296
|
+
- **LLM-Ready** - Structured metadata for accurate queries
|
|
297
|
+
- **Self-Service** - Enables non-technical analytics
|
|
298
|
+
- **Scalability** - Reduces query complexity
|
|
299
|
+
|
|
300
|
+
## Configuration
|
|
301
|
+
|
|
302
|
+
### Connection String Format
|
|
303
|
+
|
|
304
|
+
```bash
|
|
305
|
+
postgresql://username:password@hostname:port/database
|
|
306
|
+
|
|
307
|
+
# Examples:
|
|
308
|
+
postgresql://postgres:mypassword@localhost:5432/mydb
|
|
309
|
+
postgresql://user@localhost/mydb # No password
|
|
310
|
+
postgresql://user:pass@remote.host:5432/db
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
### Context File Options
|
|
314
|
+
|
|
315
|
+
```yaml
|
|
316
|
+
business_type: ecommerce_retail_b2c # Industry_subindustry format
|
|
317
|
+
entities:
|
|
318
|
+
- customers
|
|
319
|
+
- orders
|
|
320
|
+
- products
|
|
321
|
+
goals:
|
|
322
|
+
- revenue_reporting
|
|
323
|
+
- customer_lifetime_value
|
|
324
|
+
- inventory_tracking
|
|
325
|
+
temporal: historical # snapshot | historical | both
|
|
326
|
+
grain: daily,weekly,monthly # Comma-separated time grains
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
## Development
|
|
330
|
+
|
|
331
|
+
```bash
|
|
332
|
+
# Install in editable mode with dev dependencies
|
|
333
|
+
pip install -e ".[dev]"
|
|
334
|
+
|
|
335
|
+
# Run tests
|
|
336
|
+
pytest
|
|
337
|
+
|
|
338
|
+
# Format code
|
|
339
|
+
ruff format .
|
|
340
|
+
|
|
341
|
+
# Type checking
|
|
342
|
+
mypy schemalytics
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
## Troubleshooting
|
|
346
|
+
|
|
347
|
+
### Ollama Not Running
|
|
348
|
+
```bash
|
|
349
|
+
# Check if Ollama is running
|
|
350
|
+
curl http://localhost:11434/api/tags
|
|
351
|
+
|
|
352
|
+
# Start Ollama
|
|
353
|
+
ollama serve
|
|
354
|
+
```
|
|
355
|
+
|
|
356
|
+
### Database Connection Issues
|
|
357
|
+
```bash
|
|
358
|
+
# Test connection with psql
|
|
359
|
+
psql postgresql://user:pass@localhost:5432/mydb
|
|
360
|
+
|
|
361
|
+
# Check if database exists
|
|
362
|
+
psql -U postgres -l
|
|
363
|
+
```
|
|
364
|
+
|
|
365
|
+
### Model Not Found
|
|
366
|
+
```bash
|
|
367
|
+
# List available models
|
|
368
|
+
ollama list
|
|
369
|
+
|
|
370
|
+
# Pull required model
|
|
371
|
+
ollama pull qwen-data:latest
|
|
372
|
+
```
|
|
373
|
+
|
|
374
|
+
### Timeout Issues
|
|
375
|
+
Default LLM timeout is 15 minutes. For large schemas, this may need adjustment in `llm.py`:
|
|
376
|
+
```python
|
|
377
|
+
LLM_TIMEOUT = 900.0 # Increase if needed
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
## Roadmap
|
|
381
|
+
|
|
382
|
+
- [ ] Support additional databases (Snowflake, BigQuery, DuckDB)
|
|
383
|
+
- [ ] Web UI for interactive modeling
|
|
384
|
+
- [ ] Advanced SCD types (Type 3, Type 6)
|
|
385
|
+
- [ ] Data profiling and quality checks
|
|
386
|
+
- [ ] Custom business logic templates
|
|
387
|
+
- [ ] dbt Cloud integration
|
|
388
|
+
- [ ] Incremental model generation
|
|
389
|
+
- [ ] Multi-tenant support
|
|
390
|
+
|
|
391
|
+
## Contributing
|
|
392
|
+
|
|
393
|
+
Contributions welcome! Please:
|
|
394
|
+
1. Fork the repository
|
|
395
|
+
2. Create a feature branch
|
|
396
|
+
3. Add tests for new functionality
|
|
397
|
+
4. Submit a pull request
|
|
398
|
+
|
|
399
|
+
## License
|
|
400
|
+
|
|
401
|
+
MIT License - see LICENSE file for details
|
|
402
|
+
|
|
403
|
+
## Support
|
|
404
|
+
|
|
405
|
+
- **Issues**: [GitHub Issues](https://github.com/yourusername/schemalytics/issues)
|
|
406
|
+
- **Documentation**: See `semantic_layer.yml` in generated projects
|
|
407
|
+
- **Examples**: Check the `examples/` directory
|
|
408
|
+
|
|
409
|
+
## Credits
|
|
410
|
+
|
|
411
|
+
Built with:
|
|
412
|
+
- [SQLAlchemy](https://www.sqlalchemy.org/) - Database inspection
|
|
413
|
+
- [Ollama](https://ollama.ai/) - Local LLM inference
|
|
414
|
+
- [Jinja2](https://jinja.palletsprojects.com/) - SQL templating
|
|
415
|
+
- [Click](https://click.palletsprojects.com/) - CLI framework
|
|
416
|
+
- [Pydantic](https://docs.pydantic.dev/) - Data validation
|
|
417
|
+
|
|
418
|
+
---
|
|
419
|
+
|
|
420
|
+
**Schemalytics** - Transform your database into an LLM-ready analytics platform in minutes.
|