crushdataai 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (40) hide show
  1. package/assets/antigravity/data-analyst.md +95 -0
  2. package/assets/claude/SKILL.md +145 -0
  3. package/assets/copilot/data-analyst.prompt.md +40 -0
  4. package/assets/cursor/data-analyst.md +50 -0
  5. package/assets/kiro/data-analyst.md +37 -0
  6. package/assets/shared/charts.csv +31 -0
  7. package/assets/shared/cleaning.csv +21 -0
  8. package/assets/shared/data/charts.csv +31 -0
  9. package/assets/shared/data/cleaning.csv +21 -0
  10. package/assets/shared/data/databases.csv +35 -0
  11. package/assets/shared/data/industries/ecommerce.csv +25 -0
  12. package/assets/shared/data/industries/finance.csv +24 -0
  13. package/assets/shared/data/industries/marketing.csv +25 -0
  14. package/assets/shared/data/industries/saas.csv +24 -0
  15. package/assets/shared/data/metrics.csv +74 -0
  16. package/assets/shared/data/python-patterns.csv +31 -0
  17. package/assets/shared/data/report-ux.csv +26 -0
  18. package/assets/shared/data/sql-patterns.csv +36 -0
  19. package/assets/shared/data/validation.csv +21 -0
  20. package/assets/shared/data/workflows.csv +51 -0
  21. package/assets/shared/databases.csv +35 -0
  22. package/assets/shared/industries/ecommerce.csv +25 -0
  23. package/assets/shared/industries/finance.csv +24 -0
  24. package/assets/shared/industries/marketing.csv +25 -0
  25. package/assets/shared/industries/saas.csv +24 -0
  26. package/assets/shared/metrics.csv +74 -0
  27. package/assets/shared/python-patterns.csv +31 -0
  28. package/assets/shared/report-ux.csv +26 -0
  29. package/assets/shared/scripts/__pycache__/core.cpython-311.pyc +0 -0
  30. package/assets/shared/scripts/core.py +238 -0
  31. package/assets/shared/scripts/search.py +61 -0
  32. package/assets/shared/sql-patterns.csv +36 -0
  33. package/assets/shared/validation.csv +21 -0
  34. package/assets/shared/workflows.csv +51 -0
  35. package/assets/windsurf/data-analyst.md +35 -0
  36. package/dist/commands.d.ts +3 -0
  37. package/dist/commands.js +159 -0
  38. package/dist/index.d.ts +2 -0
  39. package/dist/index.js +31 -0
  40. package/package.json +45 -0
@@ -0,0 +1,95 @@
1
+ ---
2
+ description: CrushData AI - Data Analyst workflow for structured analysis with validation
3
+ ---
4
+
5
+ # CrushData AI - Data Analyst Workflow
6
+
7
+ A data analyst intelligence workflow that guides you through structured, professional data analysis.
8
+
9
+ ## When to Use
10
+
11
+ Use this workflow when user requests:
12
+ - Data analysis, EDA, ad-hoc queries
13
+ - Dashboard or report creation
14
+ - Metrics calculation (MRR, churn, conversion, etc.)
15
+ - Cohort, funnel, or A/B test analysis
16
+ - Data cleaning or profiling
17
+
18
+ ---
19
+
20
+ ## Step 1: Discovery Protocol (MANDATORY)
21
+
22
+ Before writing any code, ask the user:
23
+
24
+ 1. **Business Context**: What question should this analysis answer? Who is the audience?
25
+ 2. **Data Context**: Which tables contain the data? What time range?
26
+ 3. **Metric Definitions**: How does YOUR company define the key metrics? Any filters?
27
+
28
+ ---
29
+
30
+ ## Step 2: Search Knowledge Base
31
+
32
+ // turbo
33
+ ```bash
34
+ python3 .agent/workflows/../.shared/data-analyst/scripts/search.py "<query>" --domain <domain>
35
+ ```
36
+
37
+ **Domains:**
38
+ - `workflow` - Step-by-step analysis process
39
+ - `metric` - Metric definitions and formulas
40
+ - `chart` - Visualization recommendations
41
+ - `sql` - SQL patterns (window functions, cohorts)
42
+ - `python` - pandas/polars snippets
43
+ - `validation` - Common mistakes to avoid
44
+
45
+ **Industry search:**
46
+ // turbo
47
+ ```bash
48
+ python3 .shared/data-analyst/scripts/search.py "<query>" --industry saas|ecommerce|finance|marketing
49
+ ```
50
+
51
+ ---
52
+
53
+ ## Step 3: Data Profiling (MANDATORY)
54
+
55
+ Before analysis, run profiling and report to user:
56
+
57
+ ```python
58
+ print(f"Shape: {df.shape}")
59
+ print(f"Date range: {df['date'].min()} to {df['date'].max()}")
60
+ print(f"Missing values:\n{df.isnull().sum()}")
61
+ ```
62
+
63
+ Ask: "I found X rows, Y users, dates from A to B. Does this match expectation?"
64
+
65
+ ---
66
+
67
+ ## Step 4: Execute with Validation
68
+
69
+ **Before JOINs:** Run on 100 rows first. Ask if row count change is expected.
70
+
71
+ **Before Aggregations:** Check for duplicates. Ask if totals seem reasonable.
72
+
73
+ **Before Delivery:** Compare to benchmarks. Present for user validation.
74
+
75
+ ---
76
+
77
+ ## Search Examples
78
+
79
+ | Analysis | Command |
80
+ |----------|---------|
81
+ | EDA workflow | `search.py "EDA" --domain workflow` |
82
+ | Churn metrics | `search.py "churn" --domain metric` |
83
+ | Cohort SQL | `search.py "cohort" --domain sql` |
84
+ | SaaS metrics | `search.py "MRR" --industry saas` |
85
+ | Chart selection | `search.py "time series" --domain chart` |
86
+
87
+ ---
88
+
89
+ ## Pre-Delivery Checklist
90
+
91
+ - [ ] Business question answered
92
+ - [ ] Data profiled and validated
93
+ - [ ] Metric definitions confirmed with user
94
+ - [ ] Sanity checks passed
95
+ - [ ] Assumptions documented
@@ -0,0 +1,145 @@
1
+ # CrushData AI - Data Analyst Skill
2
+
3
+ A data analyst intelligence skill that guides you through structured, professional data analysis workflows.
4
+
5
+ ---
6
+
7
+ ## How to Use This Skill
8
+
9
+ When user requests data analysis work (analyze, query, dashboard, metrics, EDA, cohort, funnel, A/B test), follow this workflow:
10
+
11
+ ### Step 1: Discovery Protocol (MANDATORY)
12
+
13
+ **Before writing any code, ask the user:**
14
+
15
+ ```
16
+ ## Discovery Questions
17
+
18
+ 1. **Business Context**
19
+ - What business question should this analysis answer?
20
+ - Who is the audience? (Executive, Analyst, Engineer)
21
+ - What action will this analysis inform?
22
+
23
+ 2. **Data Context**
24
+ - Which tables/databases contain the relevant data?
25
+ - What time range should I analyze?
26
+ - Any known data quality issues?
27
+
28
+ 3. **Metric Definitions**
29
+ - How does YOUR company define the key metrics?
30
+ - Any filters to apply? (exclude test users, internal accounts?)
31
+ - What timezone should I use for dates?
32
+ ```
33
+
34
+ ### Step 2: Search Relevant Domains
35
+
36
+ Use `search.py` to gather comprehensive information:
37
+
38
+ ```bash
39
+ python3 .claude/skills/data-analyst/scripts/search.py "<query>" --domain <domain> [-n 3]
40
+ ```
41
+
42
+ **Available domains:**
43
+ | Domain | Use Case |
44
+ |--------|----------|
45
+ | `workflow` | Step-by-step analysis process |
46
+ | `metric` | Metric definitions and formulas |
47
+ | `chart` | Visualization recommendations |
48
+ | `cleaning` | Data quality patterns |
49
+ | `sql` | SQL patterns (window functions, cohorts) |
50
+ | `python` | pandas/polars code snippets |
51
+ | `database` | PostgreSQL, BigQuery, Snowflake tips |
52
+ | `report` | Dashboard UX guidelines |
53
+ | `validation` | Common mistakes to avoid |
54
+
55
+ **Industry-specific search:**
56
+ ```bash
57
+ python3 .claude/skills/data-analyst/scripts/search.py "<query>" --industry saas|ecommerce|finance|marketing
58
+ ```
59
+
60
+ **Recommended search order:**
61
+ 1. `workflow` - Get the step-by-step process for this analysis type
62
+ 2. `metric` or `--industry` - Get relevant metric definitions
63
+ 3. `sql` or `python` - Get code patterns for implementation
64
+ 4. `chart` - Get visualization recommendations
65
+ 5. `validation` - Check for common mistakes to avoid
66
+
67
+ ### Step 3: Data Profiling (MANDATORY Before Analysis)
68
+
69
+ Before any analysis, run profiling:
70
+
71
+ **Python:**
72
+ ```python
73
+ print(f"Shape: {df.shape}")
74
+ print(f"Date range: {df['date'].min()} to {df['date'].max()}")
75
+ print(f"Missing values:\n{df.isnull().sum()}")
76
+ print(f"Sample:\n{df.sample(5)}")
77
+ ```
78
+
79
+ **SQL:**
80
+ ```sql
81
+ SELECT
82
+ COUNT(*) as total_rows,
83
+ COUNT(DISTINCT user_id) as unique_users,
84
+ MIN(date) as min_date,
85
+ MAX(date) as max_date
86
+ FROM table;
87
+ ```
88
+
89
+ **Report findings to user before proceeding:**
90
+ > "I found X rows, Y unique users, date range from A to B. Does this match your expectation?"
91
+
92
+ ### Step 4: Execute Analysis with Validation
93
+
94
+ **Before JOINs:**
95
+ - Run on 100 rows first
96
+ - Check: Did row count change unexpectedly?
97
+ - Ask: "The join produced X rows from Y. Expected?"
98
+
99
+ **Before Aggregations:**
100
+ - Check for duplicates that could inflate sums
101
+ - Verify granularity: "Is this one row per user per day?"
102
+ - Ask: "Total = $X. Does this seem reasonable?"
103
+
104
+ **Before Delivery:**
105
+ - Sanity check order of magnitude
106
+ - Compare to benchmark or prior period
107
+ - Present for user validation before finalizing
108
+
109
+ ---
110
+
111
+ ## Workflow Reference
112
+
113
+ | Analysis Type | Search Command |
114
+ |--------------|----------------|
115
+ | EDA | `--domain workflow` query "exploratory data analysis" |
116
+ | Dashboard | `--domain workflow` query "dashboard creation" |
117
+ | A/B Test | `--domain workflow` query "ab test" |
118
+ | Cohort | `--domain workflow` query "cohort analysis" |
119
+ | Funnel | `--domain workflow` query "funnel analysis" |
120
+ | Time Series | `--domain workflow` query "time series" |
121
+ | Segmentation | `--domain workflow` query "customer segmentation" |
122
+ | Data Cleaning | `--domain workflow` query "data cleaning" |
123
+
124
+ ---
125
+
126
+ ## Common Rules
127
+
128
+ 1. **Always ask before assuming** - Metric definitions vary by company
129
+ 2. **Profile data first** - Never aggregate without understanding the data
130
+ 3. **Validate results** - Check totals, compare to benchmarks
131
+ 4. **Document assumptions** - State what filters and definitions you used
132
+ 5. **Show your work** - Explain the logic behind complex queries
133
+
134
+ ---
135
+
136
+ ## Pre-Delivery Checklist
137
+
138
+ Before presenting final results:
139
+
140
+ - [ ] Confirmed business question is answered
141
+ - [ ] Data was profiled and validated
142
+ - [ ] Metric definitions match user's expectations
143
+ - [ ] Sanity checks pass (order of magnitude, trends, etc.)
144
+ - [ ] Visualizations follow best practices (search `--domain chart`)
145
+ - [ ] Assumptions and filters are documented
@@ -0,0 +1,40 @@
1
+ ---
2
+ mode: agent
3
+ description: CrushData AI - Data analyst intelligence for structured analysis workflows
4
+ tools: ['codebase', 'terminal', 'file-operations']
5
+ ---
6
+
7
+ # CrushData AI - Data Analyst
8
+
9
+ Guide structured, professional data analysis with validation.
10
+
11
+ ## When to Activate
12
+ User requests: data analysis, dashboards, metrics, EDA, cohort, funnel, A/B tests
13
+
14
+ ## Workflow
15
+
16
+ ### 1. Discovery (MANDATORY)
17
+ Before coding, ask:
18
+ - What business question should this answer?
19
+ - Which tables contain the data?
20
+ - How does YOUR company define key metrics?
21
+
22
+ ### 2. Search Knowledge Base
23
+ ```bash
24
+ python3 .github/prompts/../.shared/data-analyst/scripts/search.py "<query>" --domain <domain>
25
+ ```
26
+
27
+ Domains: `workflow`, `metric`, `chart`, `sql`, `python`, `validation`
28
+ Industry: `--industry saas|ecommerce|finance|marketing`
29
+
30
+ ### 3. Profile Data
31
+ ```python
32
+ print(f"Shape: {df.shape}, Dates: {df['date'].min()} to {df['date'].max()}")
33
+ ```
34
+ Report and confirm before proceeding.
35
+
36
+ ### 4. Validate
37
+ - Verify JOINs
38
+ - Check totals
39
+ - Compare benchmarks
40
+ - User validation
@@ -0,0 +1,50 @@
1
+ # CrushData AI - Data Analyst Command
2
+
3
+ A data analyst intelligence command that guides you through structured, professional data analysis.
4
+
5
+ ## When to Use
6
+
7
+ Activate this command when user requests data analysis, dashboards, metrics, EDA, cohort/funnel analysis, or A/B testing.
8
+
9
+ ---
10
+
11
+ ## Workflow
12
+
13
+ ### 1. Discovery (MANDATORY)
14
+ Before coding, ask:
15
+ - What business question should this answer?
16
+ - Which tables contain the data?
17
+ - How does YOUR company define the key metrics?
18
+
19
+ ### 2. Search Knowledge Base
20
+ ```bash
21
+ python3 .cursor/commands/../.shared/data-analyst/scripts/search.py "<query>" --domain <domain>
22
+ ```
23
+
24
+ Domains: `workflow`, `metric`, `chart`, `sql`, `python`, `validation`
25
+
26
+ Industry: `--industry saas|ecommerce|finance|marketing`
27
+
28
+ ### 3. Data Profiling (MANDATORY)
29
+ ```python
30
+ print(f"Shape: {df.shape}, Date range: {df['date'].min()} to {df['date'].max()}")
31
+ ```
32
+ Report findings and ask user for confirmation.
33
+
34
+ ### 4. Validate Before Delivery
35
+ - Check JOINs don't multiply rows unexpectedly
36
+ - Verify totals seem reasonable
37
+ - Compare to benchmarks
38
+ - Present for user validation
39
+
40
+ ---
41
+
42
+ ## Quick Reference
43
+
44
+ | Analysis | Search Command |
45
+ |----------|---------------|
46
+ | EDA | `--domain workflow` "EDA" |
47
+ | Metrics | `--domain metric` "churn" |
48
+ | SQL patterns | `--domain sql` "cohort" |
49
+ | Charts | `--domain chart` "time series" |
50
+ | Mistakes | `--domain validation` "duplicate" |
@@ -0,0 +1,37 @@
1
+ # CrushData AI - Data Analyst Steering
2
+
3
+ ## Purpose
4
+ Guide structured, professional data analysis with validation protocols.
5
+
6
+ ## When to Activate
7
+ User requests: data analysis, dashboards, metrics, EDA, cohort, funnel, A/B tests
8
+
9
+ ## Required Behaviors
10
+
11
+ ### 1. Always Ask First
12
+ Before writing code, gather:
13
+ - Business question to answer
14
+ - Data tables/sources
15
+ - Company-specific metric definitions
16
+ - Time range and filters
17
+
18
+ ### 2. Search Before Implementing
19
+ ```bash
20
+ python3 .kiro/steering/../.shared/data-analyst/scripts/search.py "<query>" --domain <domain>
21
+ ```
22
+
23
+ Available domains: workflow, metric, chart, sql, python, database, validation
24
+
25
+ ### 3. Profile Data Before Analysis
26
+ Run and report:
27
+ - Row counts and date ranges
28
+ - Missing values
29
+ - Sample data
30
+
31
+ Ask: "Does this match your expectation?"
32
+
33
+ ### 4. Validate Before Delivery
34
+ - Sanity check totals
35
+ - Compare to benchmarks
36
+ - Document assumptions
37
+ - Present for user confirmation
@@ -0,0 +1,31 @@
1
+ Chart Type,Best For,Data Type,Comparison Type,Python Code,Color Guidance,Accessibility,Dashboard Tip
2
+ Line Chart,Trends over time and continuous data,Time-series,Trend,"plt.plot(df['date'], df['value']); plt.xlabel('Date'); plt.ylabel('Value')","Sequential blue/green for single metric; categorical for multiple series","Add markers for key data points; use sufficient line thickness","Place in middle section for trend visibility"
3
+ Bar Chart,Comparing categories or rankings,Categorical,Ranking Comparison,"plt.bar(df['category'], df['value']); plt.xticks(rotation=45)","Single color for one series; categorical for grouped","Ensure sufficient contrast between bars; label values directly","Use horizontal layout if labels are long"
4
+ Horizontal Bar Chart,Ranking with long labels,Categorical,Ranking,"plt.barh(df['category'], df['value'])","Single sequential color from light to dark by value","Order bars by value for easy scanning","Great for top-N lists and leaderboards"
5
+ Stacked Bar Chart,Part-to-whole over categories,Categorical,Composition,"df.plot(kind='bar', stacked=True)","Use distinct colors for each segment; max 5-6 segments","Include legend; consider labels on large segments","Good for showing composition across time periods"
6
+ Grouped Bar Chart,Comparing multiple series by category,Categorical,Comparison,"df.plot(kind='bar', position='dodge')","Categorical palette with clear distinction between groups","Limit to 3-4 groups per category; use legend","Best for A/B comparisons or time period comparisons"
7
+ Pie Chart,Simple part-to-whole (max 5 segments),Categorical,Composition,"plt.pie(df['value'], labels=df['category'], autopct='%1.1f%%')","High contrast between adjacent segments","Limit to 5 segments; order by size; include percentages","Avoid in dashboards - use donut or bar instead"
8
+ Donut Chart,Part-to-whole with center metric,Categorical,Composition,"plt.pie(df['value'], wedgeprops={'width': 0.4})","High contrast colors; use brand colors if relevant","Include total or key metric in center","Good for single KPI with breakdown"
9
+ Area Chart,Trends with volume emphasis,Time-series,Trend Volume,"plt.fill_between(df['date'], df['value'], alpha=0.3)","Light fill with darker line; sequential colors","Ensure baseline is visible; use transparency","Shows magnitude over time better than line"
10
+ Stacked Area Chart,Composition over time,Time-series,Composition Trend,"df.plot(kind='area', stacked=True)","Distinct colors for each layer; limit layers to 5","Consider 100% stacked for proportion focus","Good for market share or category mix over time"
11
+ Scatter Plot,Relationship between two variables,Numerical,Correlation,"plt.scatter(df['x'], df['y'])","Single color with alpha for density; color by category if needed","Add trendline for correlation; include R-squared","Use for identifying outliers and patterns"
12
+ Bubble Chart,Three-variable relationships,Numerical,Correlation Size,"plt.scatter(df['x'], df['y'], s=df['size']*100)","Color by category if applicable; size legend essential","Ensure bubbles don't overlap too much","Include size legend; limit to important points"
13
+ Heatmap,Correlations or matrix data,Numerical Matrix,Distribution,"sns.heatmap(df.corr(), annot=True, cmap='RdBu_r')","Diverging palette for correlation (-1 to 1); sequential for counts","Include value annotations; use colorblind-safe palette","Perfect for cohort retention tables"
14
+ Histogram,Distribution of single variable,Numerical,Distribution,"plt.hist(df['value'], bins=30)","Single color; consider outlier highlighting","Include mean/median line; label bin count","Use to understand data distribution before analysis"
15
+ Box Plot,Distribution comparison,Numerical,Distribution Comparison,"sns.boxplot(data=df, x='category', y='value')","One color per category; highlight outliers","Explain quartile meanings; include n count","Great for comparing distributions across groups"
16
+ Violin Plot,Distribution with density,Numerical,Distribution,"sns.violinplot(data=df, x='category', y='value')","Paired colors for split violins; sequential otherwise","More intuitive than box plots for some users","Good for showing bimodal distributions"
17
+ Funnel Chart,Sequential step conversion,Categorical,Drop-off,"import plotly.express as px; px.funnel(df, x='count', y='stage')","Blues from dark to light (top to bottom); or brand colors","Label conversion percentages between stages","Essential for showing conversion drop-off"
18
+ Waterfall Chart,Cumulative effect of values,Categorical,Contribution,"Use plotly or custom matplotlib with positive/negative coloring","Green for positive; red for negative; gray for subtotals","Start with total; show increases and decreases clearly","Great for bridge charts (start to end explanation)"
19
+ Gauge Chart,Single KPI with target,Single Value,Target,"Use plotly Indicator or custom graphic","Green/yellow/red zones based on targets","Include actual value and target","Use sparingly - one per major KPI"
20
+ KPI Card,Single important metric,Single Value,Status,"Text display with conditional formatting","Color based on performance (green/amber/red)","Large font; include trend arrow and context","Top of dashboard for most important metrics"
21
+ Sparkline,Compact trend indicator,Time-series,Trend,"Line chart rendered small without axes","Single color; consistent across dashboard","May be too small for some users","Great alongside KPI cards to show trend"
22
+ Table,Detailed data viewing,Multi-dimensional,Detail,"df.style.format() with conditional formatting","Alternate row colors; highlight important values","Ensure sufficient contrast; limit columns","Place at bottom of dashboard for drill-down"
23
+ Pivot Table,Cross-tabulation analysis,Multi-dimensional,Comparison,"pd.pivot_table(df, values='metric', index='row', columns='col')","Heatmap coloring on values if applicable","Include row/column totals","Good for exploration; use charts for communication"
24
+ Treemap,Hierarchical part-to-whole,Hierarchical,Composition,"import plotly.express as px; px.treemap(df, path=['parent', 'child'], values='value')","Distinct colors per category; size shows proportion","Include value labels on large segments","Good for budget/allocation visualization"
25
+ Sankey Diagram,Flow between categories,Flow,Flow,"Use plotly Sankey for flow visualization","Distinct colors per source node","Limit to 5-10 nodes for readability","Perfect for showing customer journeys"
26
+ Radar/Spider Chart,Multi-variable comparison,Multi-dimensional,Profile,"Use matplotlib radar chart or plotly","One color per entity being compared","Include reference lines; limit to 5-8 axes","Good for segment profiles or competitive analysis"
27
+ Geographic Map,Location-based data,Geographic,Distribution,"Use folium or plotly for choropleth maps","Sequential color scale for values","Ensure color scale is clear; include legend","Use for regional performance comparisons"
28
+ Calendar Heatmap,Activity over time,Time-series,Pattern,"Use calplot or custom heatmap by day","Sequential palette; highlight weekends differently","Label axes clearly; include color legend","Good for showing seasonal patterns"
29
+ Combination Chart,Mixed data types,Mixed,Correlation Trend,"Use secondary y-axis: ax2 = ax1.twinx()","Distinct colors for each series; clear legend","Ensure both scales are readable","Use when showing related but different units"
30
+ Small Multiples,Comparison across categories,Multi-dimensional,Comparison,"Use facet plot: sns.FacetGrid(df, col='category')","Consistent scale and colors across all charts","Keep individual charts simple","Great for comparing patterns across segments"
31
+ Bullet Chart,KPI vs target and comparison,Single Value,Target Comparison,"Use plotly for bullet chart implementation","Gray for comparison; bars for actual/target","Include labels for all elements","Compact alternative to gauge for multiple KPIs"
@@ -0,0 +1,21 @@
1
+ Issue Type,Detection Method,Solution,Python Code,SQL Code,Impact
2
+ Missing Values,df.isnull().sum() and df.isnull().mean(),"Drop rows, impute with mean/median/mode, forward fill, or flag with indicator","df['col'].fillna(df['col'].median(), inplace=True) or df.dropna(subset=['required_col'])","COALESCE(column, 0) or WHERE column IS NOT NULL","Missing data can skew aggregations, break joins, and cause errors"
3
+ Duplicate Rows,df.duplicated().sum() and df[df.duplicated()],"Remove exact duplicates or dedupe by key keeping latest","df.drop_duplicates(subset=['id'], keep='last', inplace=True)","WITH ranked AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) as rn FROM table) SELECT * FROM ranked WHERE rn = 1","Duplicates inflate counts, sums, and cause join multiplication"
4
+ Outliers,df.describe() and boxplot IQR method,"Remove, cap/Winsorize, or investigate - depends on domain","Q1, Q3 = df['col'].quantile([0.25, 0.75]); IQR = Q3-Q1; df = df[(df['col'] >= Q1-1.5*IQR) & (df['col'] <= Q3+1.5*IQR)]","WHERE value BETWEEN (SELECT PERCENTILE_CONT(0.25) - 1.5*IQR) AND (SELECT PERCENTILE_CONT(0.75) + 1.5*IQR)","Outliers can dominate averages and distort visualizations"
5
+ Data Type Mismatch,df.dtypes and df['col'].apply(type).value_counts(),"Convert to correct type with error handling","df['date'] = pd.to_datetime(df['date'], errors='coerce'); df['amount'] = pd.to_numeric(df['amount'], errors='coerce')","CAST(column AS DATE) or TRY_CAST for safe conversion","Wrong types cause sorting, filtering, and aggregation errors"
6
+ Inconsistent Date Formats,df['date'].str.contains(pattern).value_counts(),"Standardize to ISO format YYYY-MM-DD","df['date'] = pd.to_datetime(df['date'], format='mixed').dt.strftime('%Y-%m-%d')","TO_DATE(date_string, 'format pattern')","Inconsistent dates cause parsing errors and incorrect sorting"
7
+ Leading/Trailing Whitespace,df['col'].str.len() vs df['col'].str.strip().str.len(),"Strip whitespace from string columns","df['col'] = df['col'].str.strip()","TRIM(column)","Whitespace causes join failures and lookup misses"
8
+ Case Inconsistency,df['col'].str.lower().nunique() vs df['col'].nunique(),"Standardize to lowercase or title case","df['col'] = df['col'].str.lower() or .str.title()","LOWER(column) or UPPER(column)","Case differences cause grouping errors and aggregation issues"
9
+ Invalid Categories,df['category'].isin(valid_list).value_counts(),"Map invalid values or flag/remove","df['category'] = df['category'].replace({'invalid': 'Unknown'}); df = df[df['category'].isin(valid_list)]","CASE WHEN category IN ('valid1', 'valid2') THEN category ELSE 'Other' END","Invalid categories skew analysis and break filters"
10
+ Negative Values Where Impossible,df[df['col'] < 0].count(),"Flag, remove, or convert to absolute value","df = df[df['quantity'] >= 0] or df['quantity'] = df['quantity'].abs()","WHERE quantity >= 0 or ABS(quantity)","Negative quantities/prices indicate data entry errors"
11
+ Future Dates in Historical Data,df[df['date'] > pd.Timestamp.today()],"Remove or flag future-dated records","df = df[df['date'] <= pd.Timestamp.today()]","WHERE date <= CURRENT_DATE","Future dates indicate data pipeline or entry errors"
12
+ Zero Division Risk,df[df['denominator'] == 0].count(),"Handle zeros before division with NULLIF or fillna","df['ratio'] = df['numerator'] / df['denominator'].replace(0, np.nan)","numerator / NULLIF(denominator, 0)","Division by zero causes errors or inf values"
13
+ Encoding Issues,df['col'].str.contains('[^\x00-\x7F]'),"Fix encoding or remove special characters","df['col'] = df['col'].str.encode('ascii', 'ignore').str.decode('ascii')","REGEXP_REPLACE(col, '[^[:ascii:]]', '')","Encoding issues cause display and processing errors"
14
+ Null vs Zero Ambiguity,df['col'].isin([0, None]).value_counts(),"Decide semantic meaning: is 0 different from null?","Document decision: df['col'] = df['col'].fillna(0) # if semantically equivalent","Add explicit flag: CASE WHEN col IS NULL THEN 'Unknown' ELSE 'Known' END","Confusing null with zero leads to calculation errors"
15
+ Data Entry Typos,df['name'].str.lower().value_counts() looking for similar values,"Use fuzzy matching to identify and merge typos","from fuzzywuzzy import fuzz; identify similar strings","Use pg_trgm or Levenshtein distance functions","Typos split metrics that should be grouped together"
16
+ Orphan Records,df.merge(reference_df, how='left', indicator=True).query('_merge == \"left_only\"'),"Remove orphans or add to reference table","df = df[df['foreign_key'].isin(reference_df['id'])]","WHERE foreign_key IN (SELECT id FROM reference_table)","Orphan records indicate referential integrity issues"
17
+ Mixed Numeric Formats,df['col'].str.contains(r'[\$,€%]'),"Extract numeric values removing currency/percent symbols","df['amount'] = df['amount'].str.replace('[$,]', '', regex=True).astype(float)","CAST(REPLACE(REPLACE(col, '$', ''), ',', '') AS DECIMAL)","Mixed formats prevent numeric operations"
18
+ Boolean Inconsistency,df['flag'].unique() showing mixed True/False/1/0/Yes/No,"Standardize to consistent boolean representation","df['flag'] = df['flag'].map({'Yes': True, 'No': False, 1: True, 0: False})","CASE WHEN flag IN ('Yes', 'Y', '1', 'true') THEN TRUE ELSE FALSE END","Inconsistent booleans cause filtering errors"
19
+ Data Freshness,df['updated_at'].max() vs expected freshness,"Alert if data is stale beyond threshold","assert (pd.Timestamp.today() - df['updated_at'].max()).days < 1, 'Data is stale'","WHERE updated_at >= CURRENT_DATE - INTERVAL '1 day'","Stale data leads to incorrect analysis and decisions"
20
+ Cardinality Changes,Compare df['col'].nunique() to historical baseline,"Alert if cardinality changes unexpectedly","assert df['category'].nunique() == expected_count, f'Expected {expected_count} categories'","SELECT COUNT(DISTINCT col) and compare to metadata","New or missing categories indicate upstream issues"
21
+ Range Violations,df[~df['col'].between(min_val, max_val)],"Flag or remove out-of-range values","df = df[df['age'].between(0, 120)]","WHERE age BETWEEN 0 AND 120","Out-of-range values indicate data quality issues"
@@ -0,0 +1,31 @@
1
+ Chart Type,Best For,Data Type,Comparison Type,Python Code,Color Guidance,Accessibility,Dashboard Tip
2
+ Line Chart,Trends over time and continuous data,Time-series,Trend,"plt.plot(df['date'], df['value']); plt.xlabel('Date'); plt.ylabel('Value')","Sequential blue/green for single metric; categorical for multiple series","Add markers for key data points; use sufficient line thickness","Place in middle section for trend visibility"
3
+ Bar Chart,Comparing categories or rankings,Categorical,Ranking Comparison,"plt.bar(df['category'], df['value']); plt.xticks(rotation=45)","Single color for one series; categorical for grouped","Ensure sufficient contrast between bars; label values directly","Use horizontal layout if labels are long"
4
+ Horizontal Bar Chart,Ranking with long labels,Categorical,Ranking,"plt.barh(df['category'], df['value'])","Single sequential color from light to dark by value","Order bars by value for easy scanning","Great for top-N lists and leaderboards"
5
+ Stacked Bar Chart,Part-to-whole over categories,Categorical,Composition,"df.plot(kind='bar', stacked=True)","Use distinct colors for each segment; max 5-6 segments","Include legend; consider labels on large segments","Good for showing composition across time periods"
6
+ Grouped Bar Chart,Comparing multiple series by category,Categorical,Comparison,"df.plot(kind='bar', position='dodge')","Categorical palette with clear distinction between groups","Limit to 3-4 groups per category; use legend","Best for A/B comparisons or time period comparisons"
7
+ Pie Chart,Simple part-to-whole (max 5 segments),Categorical,Composition,"plt.pie(df['value'], labels=df['category'], autopct='%1.1f%%')","High contrast between adjacent segments","Limit to 5 segments; order by size; include percentages","Avoid in dashboards - use donut or bar instead"
8
+ Donut Chart,Part-to-whole with center metric,Categorical,Composition,"plt.pie(df['value'], wedgeprops={'width': 0.4})","High contrast colors; use brand colors if relevant","Include total or key metric in center","Good for single KPI with breakdown"
9
+ Area Chart,Trends with volume emphasis,Time-series,Trend Volume,"plt.fill_between(df['date'], df['value'], alpha=0.3)","Light fill with darker line; sequential colors","Ensure baseline is visible; use transparency","Shows magnitude over time better than line"
10
+ Stacked Area Chart,Composition over time,Time-series,Composition Trend,"df.plot(kind='area', stacked=True)","Distinct colors for each layer; limit layers to 5","Consider 100% stacked for proportion focus","Good for market share or category mix over time"
11
+ Scatter Plot,Relationship between two variables,Numerical,Correlation,"plt.scatter(df['x'], df['y'])","Single color with alpha for density; color by category if needed","Add trendline for correlation; include R-squared","Use for identifying outliers and patterns"
12
+ Bubble Chart,Three-variable relationships,Numerical,Correlation Size,"plt.scatter(df['x'], df['y'], s=df['size']*100)","Color by category if applicable; size legend essential","Ensure bubbles don't overlap too much","Include size legend; limit to important points"
13
+ Heatmap,Correlations or matrix data,Numerical Matrix,Distribution,"sns.heatmap(df.corr(), annot=True, cmap='RdBu_r')","Diverging palette for correlation (-1 to 1); sequential for counts","Include value annotations; use colorblind-safe palette","Perfect for cohort retention tables"
14
+ Histogram,Distribution of single variable,Numerical,Distribution,"plt.hist(df['value'], bins=30)","Single color; consider outlier highlighting","Include mean/median line; label bin count","Use to understand data distribution before analysis"
15
+ Box Plot,Distribution comparison,Numerical,Distribution Comparison,"sns.boxplot(data=df, x='category', y='value')","One color per category; highlight outliers","Explain quartile meanings; include n count","Great for comparing distributions across groups"
16
+ Violin Plot,Distribution with density,Numerical,Distribution,"sns.violinplot(data=df, x='category', y='value')","Paired colors for split violins; sequential otherwise","More intuitive than box plots for some users","Good for showing bimodal distributions"
17
+ Funnel Chart,Sequential step conversion,Categorical,Drop-off,"import plotly.express as px; px.funnel(df, x='count', y='stage')","Blues from dark to light (top to bottom); or brand colors","Label conversion percentages between stages","Essential for showing conversion drop-off"
18
+ Waterfall Chart,Cumulative effect of values,Categorical,Contribution,"Use plotly or custom matplotlib with positive/negative coloring","Green for positive; red for negative; gray for subtotals","Start with total; show increases and decreases clearly","Great for bridge charts (start to end explanation)"
19
+ Gauge Chart,Single KPI with target,Single Value,Target,"Use plotly Indicator or custom graphic","Green/yellow/red zones based on targets","Include actual value and target","Use sparingly - one per major KPI"
20
+ KPI Card,Single important metric,Single Value,Status,"Text display with conditional formatting","Color based on performance (green/amber/red)","Large font; include trend arrow and context","Top of dashboard for most important metrics"
21
+ Sparkline,Compact trend indicator,Time-series,Trend,"Line chart rendered small without axes","Single color; consistent across dashboard","May be too small for some users","Great alongside KPI cards to show trend"
22
+ Table,Detailed data viewing,Multi-dimensional,Detail,"df.style.format() with conditional formatting","Alternate row colors; highlight important values","Ensure sufficient contrast; limit columns","Place at bottom of dashboard for drill-down"
23
+ Pivot Table,Cross-tabulation analysis,Multi-dimensional,Comparison,"pd.pivot_table(df, values='metric', index='row', columns='col')","Heatmap coloring on values if applicable","Include row/column totals","Good for exploration; use charts for communication"
24
+ Treemap,Hierarchical part-to-whole,Hierarchical,Composition,"import plotly.express as px; px.treemap(df, path=['parent', 'child'], values='value')","Distinct colors per category; size shows proportion","Include value labels on large segments","Good for budget/allocation visualization"
25
+ Sankey Diagram,Flow between categories,Flow,Flow,"Use plotly Sankey for flow visualization","Distinct colors per source node","Limit to 5-10 nodes for readability","Perfect for showing customer journeys"
26
+ Radar/Spider Chart,Multi-variable comparison,Multi-dimensional,Profile,"Use matplotlib radar chart or plotly","One color per entity being compared","Include reference lines; limit to 5-8 axes","Good for segment profiles or competitive analysis"
27
+ Geographic Map,Location-based data,Geographic,Distribution,"Use folium or plotly for choropleth maps","Sequential color scale for values","Ensure color scale is clear; include legend","Use for regional performance comparisons"
28
+ Calendar Heatmap,Activity over time,Time-series,Pattern,"Use calplot or custom heatmap by day","Sequential palette; highlight weekends differently","Label axes clearly; include color legend","Good for showing seasonal patterns"
29
+ Combination Chart,Mixed data types,Mixed,Correlation Trend,"Use secondary y-axis: ax2 = ax1.twinx()","Distinct colors for each series; clear legend","Ensure both scales are readable","Use when showing related but different units"
30
+ Small Multiples,Comparison across categories,Multi-dimensional,Comparison,"Use facet plot: sns.FacetGrid(df, col='category')","Consistent scale and colors across all charts","Keep individual charts simple","Great for comparing patterns across segments"
31
+ Bullet Chart,KPI vs target and comparison,Single Value,Target Comparison,"Use plotly for bullet chart implementation","Gray for comparison; bars for actual/target","Include labels for all elements","Compact alternative to gauge for multiple KPIs"
@@ -0,0 +1,21 @@
1
+ Issue Type,Detection Method,Solution,Python Code,SQL Code,Impact
2
+ Missing Values,df.isnull().sum() and df.isnull().mean(),"Drop rows, impute with mean/median/mode, forward fill, or flag with indicator","df['col'].fillna(df['col'].median(), inplace=True) or df.dropna(subset=['required_col'])","COALESCE(column, 0) or WHERE column IS NOT NULL","Missing data can skew aggregations, break joins, and cause errors"
3
+ Duplicate Rows,df.duplicated().sum() and df[df.duplicated()],"Remove exact duplicates or dedupe by key keeping latest","df.drop_duplicates(subset=['id'], keep='last', inplace=True)","WITH ranked AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) as rn FROM table) SELECT * FROM ranked WHERE rn = 1","Duplicates inflate counts, sums, and cause join multiplication"
4
+ Outliers,df.describe() and boxplot IQR method,"Remove, cap/Winsorize, or investigate - depends on domain","Q1, Q3 = df['col'].quantile([0.25, 0.75]); IQR = Q3-Q1; df = df[(df['col'] >= Q1-1.5*IQR) & (df['col'] <= Q3+1.5*IQR)]","WHERE value BETWEEN (SELECT PERCENTILE_CONT(0.25) - 1.5*IQR) AND (SELECT PERCENTILE_CONT(0.75) + 1.5*IQR)","Outliers can dominate averages and distort visualizations"
5
+ Data Type Mismatch,df.dtypes and df['col'].apply(type).value_counts(),"Convert to correct type with error handling","df['date'] = pd.to_datetime(df['date'], errors='coerce'); df['amount'] = pd.to_numeric(df['amount'], errors='coerce')","CAST(column AS DATE) or TRY_CAST for safe conversion","Wrong types cause sorting, filtering, and aggregation errors"
6
+ Inconsistent Date Formats,df['date'].str.contains(pattern).value_counts(),"Standardize to ISO format YYYY-MM-DD","df['date'] = pd.to_datetime(df['date'], format='mixed').dt.strftime('%Y-%m-%d')","TO_DATE(date_string, 'format pattern')","Inconsistent dates cause parsing errors and incorrect sorting"
7
+ Leading/Trailing Whitespace,df['col'].str.len() vs df['col'].str.strip().str.len(),"Strip whitespace from string columns","df['col'] = df['col'].str.strip()","TRIM(column)","Whitespace causes join failures and lookup misses"
8
+ Case Inconsistency,df['col'].str.lower().nunique() vs df['col'].nunique(),"Standardize to lowercase or title case","df['col'] = df['col'].str.lower() or .str.title()","LOWER(column) or UPPER(column)","Case differences cause grouping errors and aggregation issues"
9
+ Invalid Categories,df['category'].isin(valid_list).value_counts(),"Map invalid values or flag/remove","df['category'] = df['category'].replace({'invalid': 'Unknown'}); df = df[df['category'].isin(valid_list)]","CASE WHEN category IN ('valid1', 'valid2') THEN category ELSE 'Other' END","Invalid categories skew analysis and break filters"
10
+ Negative Values Where Impossible,df[df['col'] < 0].count(),"Flag, remove, or convert to absolute value","df = df[df['quantity'] >= 0] or df['quantity'] = df['quantity'].abs()","WHERE quantity >= 0 or ABS(quantity)","Negative quantities/prices indicate data entry errors"
11
+ Future Dates in Historical Data,df[df['date'] > pd.Timestamp.today()],"Remove or flag future-dated records","df = df[df['date'] <= pd.Timestamp.today()]","WHERE date <= CURRENT_DATE","Future dates indicate data pipeline or entry errors"
12
+ Zero Division Risk,df[df['denominator'] == 0].count(),"Handle zeros before division with NULLIF or fillna","df['ratio'] = df['numerator'] / df['denominator'].replace(0, np.nan)","numerator / NULLIF(denominator, 0)","Division by zero causes errors or inf values"
13
+ Encoding Issues,df['col'].str.contains('[^\x00-\x7F]'),"Fix encoding or remove special characters","df['col'] = df['col'].str.encode('ascii', 'ignore').str.decode('ascii')","REGEXP_REPLACE(col, '[^[:ascii:]]', '')","Encoding issues cause display and processing errors"
14
+ Null vs Zero Ambiguity,df['col'].isin([0, None]).value_counts(),"Decide semantic meaning: is 0 different from null?","Document decision: df['col'] = df['col'].fillna(0) # if semantically equivalent","Add explicit flag: CASE WHEN col IS NULL THEN 'Unknown' ELSE 'Known' END","Confusing null with zero leads to calculation errors"
15
+ Data Entry Typos,df['name'].str.lower().value_counts() looking for similar values,"Use fuzzy matching to identify and merge typos","from fuzzywuzzy import fuzz; identify similar strings","Use pg_trgm or Levenshtein distance functions","Typos split metrics that should be grouped together"
16
+ Orphan Records,df.merge(reference_df, how='left', indicator=True).query('_merge == \"left_only\"'),"Remove orphans or add to reference table","df = df[df['foreign_key'].isin(reference_df['id'])]","WHERE foreign_key IN (SELECT id FROM reference_table)","Orphan records indicate referential integrity issues"
17
+ Mixed Numeric Formats,df['col'].str.contains(r'[\$,€%]'),"Extract numeric values removing currency/percent symbols","df['amount'] = df['amount'].str.replace('[$,]', '', regex=True).astype(float)","CAST(REPLACE(REPLACE(col, '$', ''), ',', '') AS DECIMAL)","Mixed formats prevent numeric operations"
18
+ Boolean Inconsistency,df['flag'].unique() showing mixed True/False/1/0/Yes/No,"Standardize to consistent boolean representation","df['flag'] = df['flag'].map({'Yes': True, 'No': False, 1: True, 0: False})","CASE WHEN flag IN ('Yes', 'Y', '1', 'true') THEN TRUE ELSE FALSE END","Inconsistent booleans cause filtering errors"
19
+ Data Freshness,df['updated_at'].max() vs expected freshness,"Alert if data is stale beyond threshold","assert (pd.Timestamp.today() - df['updated_at'].max()).days < 1, 'Data is stale'","WHERE updated_at >= CURRENT_DATE - INTERVAL '1 day'","Stale data leads to incorrect analysis and decisions"
20
+ Cardinality Changes,Compare df['col'].nunique() to historical baseline,"Alert if cardinality changes unexpectedly","assert df['category'].nunique() == expected_count, f'Expected {expected_count} categories'","SELECT COUNT(DISTINCT col) and compare to metadata","New or missing categories indicate upstream issues"
21
+ Range Violations,df[~df['col'].between(min_val, max_val)],"Flag or remove out-of-range values","df = df[df['age'].between(0, 120)]","WHERE age BETWEEN 0 AND 120","Out-of-range values indicate data quality issues"
@@ -0,0 +1,35 @@
1
+ Database,Category,Guideline,Do,Don't,Code Example
2
+ PostgreSQL,Connection,Use connection pooling for efficiency,"Use psycopg2 pool or SQLAlchemy with pool_size","Create new connection for each query","from sqlalchemy import create_engine; engine = create_engine('postgresql://...', pool_size=5)"
3
+ PostgreSQL,Query,Use EXPLAIN ANALYZE for query tuning,"EXPLAIN ANALYZE your slow queries; check for seq scans","Optimize blindly without understanding execution plan","EXPLAIN ANALYZE SELECT * FROM orders WHERE date > '2024-01-01'"
4
+ PostgreSQL,Indexing,Create indexes on filtered and joined columns,"CREATE INDEX idx_date ON orders(order_date)","Index every column; forget to ANALYZE after","CREATE INDEX CONCURRENTLY to avoid locking"
5
+ PostgreSQL,Dates,Use date_trunc for time grouping,"date_trunc('month', order_date)","String manipulation on dates","SELECT date_trunc('day', ts) as day, COUNT(*) FROM events GROUP BY 1"
6
+ PostgreSQL,Window,Use window functions for analytics,"OVER (PARTITION BY ... ORDER BY ...)","Self-joins for running totals","SUM(amount) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING)"
7
+ PostgreSQL,CTEs,Use CTEs for readable complex queries,"WITH step1 AS (...), step2 AS (...)","Deeply nested subqueries","WITH monthly AS (SELECT date_trunc('month', date) ...) SELECT * FROM monthly"
8
+ BigQuery,Cost,Limit scanned data with partitions and clustering,"Use WHERE on partition column; SELECT only needed columns","SELECT * without partition filter","WHERE _PARTITIONDATE BETWEEN '2024-01-01' AND '2024-01-31'"
9
+ BigQuery,Partitioning,Partition tables by date for cost and performance,"PARTITION BY DATE(timestamp_column)","Query without partition filter","CREATE TABLE ... PARTITION BY DATE(created_at)"
10
+ BigQuery,Slots,Understand slot allocation for query performance,"Use INFORMATION_SCHEMA for slot usage; optimize large scans","Ignore slot exhaustion warnings","SELECT * FROM project.INFORMATION_SCHEMA.JOBS_BY_PROJECT"
11
+ BigQuery,UDFs,Use standard SQL before custom JavaScript UDFs,"Built-in functions are optimized","JavaScript UDFs for simple operations","Use SAFE_DIVIDE, IF, CASE instead of JS"
12
+ BigQuery,Approximate,Use approximate functions for large datasets,"APPROX_COUNT_DISTINCT, APPROX_QUANTILES","Exact distinct counts on huge tables","SELECT APPROX_COUNT_DISTINCT(user_id) FROM events"
13
+ BigQuery,Qualify,Use QUALIFY for window function filtering,"QUALIFY ROW_NUMBER() OVER (...) = 1","Subquery wrapper for filtering","SELECT * FROM table QUALIFY RANK() OVER (PARTITION BY id ORDER BY ts DESC) = 1"
14
+ Snowflake,Warehouse,Size warehouse appropriately for workload,"Use X-Small for dev; auto-suspend after 60s","Leave large warehouse running idle","ALTER WAREHOUSE SET WAREHOUSE_SIZE = 'MEDIUM' AUTO_SUSPEND = 60"
15
+ Snowflake,Clustering,Use clustering keys for large tables,"Cluster on frequently filtered columns","Cluster on high-cardinality columns","ALTER TABLE orders CLUSTER BY (region, order_date)"
16
+ Snowflake,Time Travel,Use time travel for data recovery and analysis,"Query historical data: AT (TIMESTAMP => ...)","Forget Time Travel is available for debugging","SELECT * FROM table AT (OFFSET => -3600)"
17
+ Snowflake,Zero Copy Clone,Clone tables for testing without storage cost,"CREATE TABLE test_copy CLONE production","Full physical copies for testing","CREATE TABLE dev.orders CLONE prod.orders"
18
+ MySQL,Indexes,Use composite indexes for multi-column queries,"Create index matching WHERE + ORDER BY columns","Too many indexes slow writes","CREATE INDEX idx_user_date ON orders(user_id, order_date)"
19
+ MySQL,Query Cache,Understand query cache behavior (deprecated in 8.0),"Use application-level caching instead","Rely on MySQL query cache","Use Redis or Memcached for caching"
20
+ MySQL,Limits,Use LIMIT with ORDER BY for pagination,"ORDER BY id LIMIT 100 OFFSET 200","LIMIT without ORDER BY (non-deterministic)","SELECT * FROM users ORDER BY id LIMIT 100 OFFSET 0"
21
+ MySQL,Joins,Prefer JOINs over subqueries when possible,"Rewrite correlated subqueries as JOINs","Correlated subqueries for large datasets","JOIN instead of WHERE x IN (SELECT ...)"
22
+ SQLite,Limitations,Understand SQLite limitations for analytics,"Good for local dev and small datasets","Use for production with concurrent writes","Maximum practical size ~1TB; limited concurrency"
23
+ SQLite,Types,SQLite uses dynamic typing,"Check actual types: typeof(column)","Assume strict typing like other databases","Be aware: '1' and 1 may both be stored"
24
+ Redshift,Distribution,Choose distribution key for join performance,"Distribute on frequently joined column","ALL distribution for large tables","CREATE TABLE ... DISTKEY(user_id)"
25
+ Redshift,Sort Keys,Use sort keys for range queries,"Sort on commonly filtered date columns","Too many sort key columns","CREATE TABLE ... SORTKEY(created_date)"
26
+ Redshift,Vacuum,Run VACUUM and ANALYZE regularly,"Schedule VACUUM DELETE ONLY weekly","Forget to reclaim space after deletes","VACUUM FULL table_name; ANALYZE table_name"
27
+ Redshift,Spectrum,Use Spectrum for querying S3 data directly,"External tables for cold/historical data","Load all data into Redshift","CREATE EXTERNAL TABLE pointing to S3"
28
+ MongoDB,Aggregation,Use aggregation pipeline for analytics,"$match early to reduce documents","Process large result sets in application","db.collection.aggregate([{$match: ...}, {$group: ...}])"
29
+ MongoDB,Indexes,Create indexes for query patterns,"Compound indexes matching query predicates","Index fields not used in queries","db.collection.createIndex({user_id: 1, date: -1})"
30
+ DynamoDB,Queries,Design for query patterns not data model,"Access patterns determine table design","Scan operations on large tables","Use Query with partition key; avoid Scan"
31
+ DynamoDB,GSI,Use Global Secondary Indexes for alternate access,"GSI for different query patterns","Too many GSIs (cost and write amplification)","Create GSI for each major access pattern"
32
+ General,Testing,Test queries on sample data first,"Use LIMIT or sampling for initial development","Run untested queries on full production data","SELECT * FROM table TABLESAMPLE (1 PERCENT)"
33
+ General,Transactions,Use transactions for data integrity,"Wrap related changes in transactions","Auto-commit for multi-statement updates","BEGIN; UPDATE ...; UPDATE ...; COMMIT;"
34
+ General,Comments,Document complex queries with comments,"Add comments explaining business logic","Uncommented complex SQL","-- Calculate 30-day rolling revenue per customer"
35
+ General,Parameterization,Use parameterized queries to prevent SQL injection,"Use ? or :param placeholders","String concatenation for query building","cursor.execute('SELECT * FROM users WHERE id = ?', (user_id,))"
@@ -0,0 +1,25 @@
1
+ Metric Name,Abbreviation,Category,Formula,Interpretation,Good Benchmark,Related Metrics,Visualization
2
+ Conversion Rate,CVR,Conversion,"Orders / Sessions * 100","Percentage of visits resulting in purchase","2-3% average; 5%+ excellent","AOV, Traffic, Add to Cart Rate",Line chart trend
3
+ Add to Cart Rate,ATC Rate,Conversion,"Add to Carts / Sessions * 100","Percentage adding items to cart","5-10% typical","CVR, Cart Abandonment",Funnel
4
+ Cart Abandonment Rate,Cart Abandon,Conversion,"(Carts Created - Purchases) / Carts Created * 100","Percentage of carts not converted","~70% average; < 60% is good","Checkout Conversion, Payment Failure",Funnel
5
+ Checkout Completion Rate,Checkout,Conversion,"Purchases / Checkout Started * 100","Checkout funnel success","70-80% is good","Cart Abandonment, Payment Methods",Funnel
6
+ Average Order Value,AOV,Revenue,"Total Revenue / Number of Orders","Average amount per order","Varies by category; track trend","Revenue, Items per Order",KPI card trend
7
+ Revenue Per Visitor,RPV,Revenue,"Total Revenue / Total Visitors","Revenue generated per visit","AOV * CVR","CVR, AOV, Traffic",KPI card
8
+ Gross Merchandise Value,GMV,Revenue,"Total value of merchandise sold","Platform transaction volume","Growing MoM/YoY","Revenue, Take Rate",Line chart
9
+ Net Revenue,Net Revenue,Revenue,"GMV - Returns - Discounts - Cancellations","Actual revenue after adjustments","Net/Gross ratio trending up","GMV, Return Rate",Line chart
10
+ Customer Acquisition Cost,CAC,Acquisition,"Marketing Spend / New Customers","Cost to acquire new customer","CAC < first order margin","LTV, ROAS",KPI card
11
+ Customer Lifetime Value,CLV,Acquisition,"AOV * Purchase Frequency * Customer Lifespan","Total expected customer revenue","CLV:CAC > 3:1","CAC, Repeat Rate, AOV",KPI card
12
+ Cost Per Order,CPO,Acquisition,"Marketing Spend / Orders","Marketing cost per order","Track by channel","CAC, ROAS",Bar by channel
13
+ Return on Ad Spend,ROAS,Acquisition,"Revenue from Ads / Ad Spend","Revenue per advertising dollar","3:1+ typically profitable","CAC, CPO",Line chart
14
+ Repeat Purchase Rate,Repeat Rate,Retention,"Customers with 2+ Orders / Total Customers * 100","Customer returning to buy again","> 30% for healthy retention","CLV, Purchase Frequency",Line chart
15
+ Purchase Frequency,Frequency,Retention,"Total Orders / Unique Customers (in period)","Average orders per customer","Varies by category; track trend","Repeat Rate, CLV",KPI card
16
+ Time Between Purchases,TBP,Retention,"Average days between customer orders","Purchase cycle length","Use for remarketing timing","Frequency, Repeat Rate",Histogram
17
+ Reactivation Rate,Reactivation,Retention,"Dormant Customers Who Returned / Total Dormant * 100","Success of win-back campaigns","5-15% typical for campaigns","Churn, Win-back Campaigns",Bar chart
18
+ Return Rate,Return Rate,Fulfillment,"Returned Orders / Total Orders * 100","Percentage of orders returned","< 20% most; < 30% apparel","Net Revenue, COGS",Line chart
19
+ On-Time Delivery Rate,OTD,Fulfillment,"Orders Delivered On Time / Total Shipped * 100","Shipping reliability","> 95% is excellent","Customer Satisfaction",Gauge
20
+ Stock-out Rate,Stockout,Inventory,"Items Out of Stock / Total SKUs * 100","Inventory availability","< 5% for popular items","Inventory Turnover, Lost Sales",Line chart
21
+ Inventory Turnover,Inv Turn,Inventory,"COGS / Average Inventory","How fast inventory sells","Higher is better; varies by cat","DOI, Stockout Rate",Bar chart
22
+ Days of Inventory,DOI,Inventory,"Average Inventory / (COGS / 365)","Days to sell current inventory","30-60 days typical","Inventory Turnover",KPI card
23
+ Website Traffic,Traffic,Traffic,"Total Sessions","Visit volume to site","Growing with quality","Bounce Rate, CVR",Line chart
24
+ Bounce Rate,Bounce,Traffic,"Single-page Sessions / Total Sessions * 100","Visitors leaving immediately","< 40% for product pages","Time on Site, Pages/Session",Line chart
25
+ Pages per Session,PPS,Traffic,"Total Pageviews / Sessions","Engagement depth","2-4 typical; higher for discovery","Bounce Rate, Time on Site",KPI card
@@ -0,0 +1,24 @@
1
+ Metric Name,Abbreviation,Category,Formula,Interpretation,Good Benchmark,Related Metrics,Visualization
2
+ Gross Profit Margin,GPM,Profitability,"(Revenue - COGS) / Revenue * 100","Profit after direct costs","Varies by industry; 40-60% typical","Net Margin, COGS",KPI card
3
+ Net Profit Margin,NPM,Profitability,"Net Income / Revenue * 100","Bottom line profitability","Positive and stable","GPM, Operating Margin",KPI card
4
+ Operating Margin,Op Margin,Profitability,"Operating Income / Revenue * 100","Core business profitability","Positive for viable business","EBITDA Margin, SG&A",KPI card
5
+ EBITDA Margin,EBITDA,Profitability,"EBITDA / Revenue * 100","Cash-based profitability","Used for comparison across capital structures","Operating Margin, D&A",KPI card
6
+ Return on Assets,ROA,Returns,"Net Income / Total Assets * 100","Asset efficiency","> 5% generally good","ROE, Asset Turnover",KPI card
7
+ Return on Equity,ROE,Returns,"Net Income / Shareholders Equity * 100","Return to shareholders","> 15% considered good","ROA, Leverage",KPI card
8
+ Return on Investment,ROI,Returns,"(Gain - Cost) / Cost * 100","Investment profitability","> 0% means profitable","IRR, Payback Period",KPI card
9
+ Return on Capital Employed,ROCE,Returns,"EBIT / Capital Employed * 100","Efficiency of capital use","> cost of capital","ROIC, Capital Efficiency",KPI card
10
+ Current Ratio,Current,Liquidity,"Current Assets / Current Liabilities","Short-term liquidity","1.5-2.0 typically healthy","Quick Ratio, Working Capital",Gauge
11
+ Quick Ratio,Quick Ratio,Liquidity,"(Current Assets - Inventory) / Current Liabilities","Immediate liquidity","> 1.0 generally healthy","Current Ratio, Cash Ratio",Gauge
12
+ Cash Ratio,Cash Ratio,Liquidity,"Cash / Current Liabilities","Most conservative liquidity","> 0.5 is comfortable","Quick Ratio, Operating Cash",KPI card
13
+ Working Capital,Working Cap,Liquidity,"Current Assets - Current Liabilities","Operating liquidity","Positive and stable","Current Ratio, Cash Conversion",KPI card
14
+ Debt to Equity Ratio,D/E,Leverage,"Total Debt / Shareholders Equity","Financial leverage","< 2.0 generally healthy","Leverage Ratio, Interest Coverage",KPI card
15
+ Debt to Assets Ratio,D/A,Leverage,"Total Debt / Total Assets","Asset leverage","< 0.5 is conservative","D/E, Asset Coverage",KPI card
16
+ Interest Coverage Ratio,ICR,Leverage,"EBIT / Interest Expense","Ability to pay interest","> 3.0 is comfortable","D/E, Debt Service",KPI card
17
+ Asset Turnover,Asset Turn,Efficiency,"Revenue / Average Total Assets","Asset productivity","Higher is more efficient","ROA, Inventory Turnover",KPI card
18
+ Receivables Turnover,AR Turn,Efficiency,"Revenue / Average Accounts Receivable","Collection efficiency","Higher = faster collection","DSO, Cash Conversion",KPI card
19
+ Days Sales Outstanding,DSO,Efficiency,"(Accounts Receivable / Revenue) * 365","Average collection period","Lower is better; industry varies","AR Turnover, Cash Cycle",KPI card
20
+ Cash Conversion Cycle,CCC,Efficiency,"DIO + DSO - DPO","Days to convert inventory to cash","Shorter is better","DSO, DIO, DPO",KPI card
21
+ Revenue Growth Rate,Rev Growth,Growth,"(Current - Prior) / Prior * 100","Revenue increase rate","Depends on stage","YoY, MoM, CAGR",Line chart
22
+ CAGR,CAGR,Growth,"(End Value / Start Value)^(1/Years) - 1","Compound annual growth","Track vs peers and market","Revenue Growth, Projections",KPI card
23
+ Burn Rate,Burn,Cash,"Monthly Operating Expenses - Monthly Revenue","Net cash consumed","Lower = more runway","Runway, Cash Balance",Line chart
24
+ Runway,Runway,Cash,"Cash Balance / Monthly Burn Rate","Months of operations left","> 18 months for fundraising","Burn Rate, Cash Balance",KPI card