crushdataai 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/assets/antigravity/data-analyst.md +95 -0
- package/assets/claude/SKILL.md +145 -0
- package/assets/copilot/data-analyst.prompt.md +40 -0
- package/assets/cursor/data-analyst.md +50 -0
- package/assets/kiro/data-analyst.md +37 -0
- package/assets/shared/charts.csv +31 -0
- package/assets/shared/cleaning.csv +21 -0
- package/assets/shared/data/charts.csv +31 -0
- package/assets/shared/data/cleaning.csv +21 -0
- package/assets/shared/data/databases.csv +35 -0
- package/assets/shared/data/industries/ecommerce.csv +25 -0
- package/assets/shared/data/industries/finance.csv +24 -0
- package/assets/shared/data/industries/marketing.csv +25 -0
- package/assets/shared/data/industries/saas.csv +24 -0
- package/assets/shared/data/metrics.csv +74 -0
- package/assets/shared/data/python-patterns.csv +31 -0
- package/assets/shared/data/report-ux.csv +26 -0
- package/assets/shared/data/sql-patterns.csv +36 -0
- package/assets/shared/data/validation.csv +21 -0
- package/assets/shared/data/workflows.csv +51 -0
- package/assets/shared/databases.csv +35 -0
- package/assets/shared/industries/ecommerce.csv +25 -0
- package/assets/shared/industries/finance.csv +24 -0
- package/assets/shared/industries/marketing.csv +25 -0
- package/assets/shared/industries/saas.csv +24 -0
- package/assets/shared/metrics.csv +74 -0
- package/assets/shared/python-patterns.csv +31 -0
- package/assets/shared/report-ux.csv +26 -0
- package/assets/shared/scripts/__pycache__/core.cpython-311.pyc +0 -0
- package/assets/shared/scripts/core.py +238 -0
- package/assets/shared/scripts/search.py +61 -0
- package/assets/shared/sql-patterns.csv +36 -0
- package/assets/shared/validation.csv +21 -0
- package/assets/shared/workflows.csv +51 -0
- package/assets/windsurf/data-analyst.md +35 -0
- package/dist/commands.d.ts +3 -0
- package/dist/commands.js +159 -0
- package/dist/index.d.ts +2 -0
- package/dist/index.js +31 -0
- package/package.json +45 -0
|
@@ -0,0 +1,74 @@
|
|
|
1
|
+
Metric Name,Abbreviation,Industry,Formula,Interpretation,Good Benchmark,Related Metrics,Visualization
|
|
2
|
+
Monthly Recurring Revenue,MRR,SaaS,"SUM(active_subscriptions * monthly_price)","Predictable monthly revenue from subscriptions","Growing 10%+ MoM for early stage","ARR, ARPU, Churn",Line chart with trend
|
|
3
|
+
Annual Recurring Revenue,ARR,SaaS,"MRR * 12","Annualized predictable revenue","ARR > $1M for Series A","MRR, Growth Rate",KPI card with YoY change
|
|
4
|
+
Customer Churn Rate,Churn,SaaS,"Churned Customers / Total Customers at Period Start * 100","Percentage of customers lost in period","< 5% monthly for B2B SaaS","NRR, Retention Rate, Customer Lifetime",Line chart or gauge
|
|
5
|
+
Customer Acquisition Cost,CAC,SaaS,"(Sales Spend + Marketing Spend) / New Customers Acquired","Total cost to acquire one new customer","LTV:CAC > 3:1","LTV, Payback Period, Marketing Spend",KPI card with trend
|
|
6
|
+
Customer Lifetime Value,LTV,SaaS,"ARPU * Customer Lifetime (1/Churn Rate)","Total revenue expected from a customer over their lifetime","LTV:CAC > 3:1","CAC, Churn, ARPU",KPI card
|
|
7
|
+
LTV to CAC Ratio,LTV:CAC,SaaS,"LTV / CAC","Health of unit economics - value vs acquisition cost","> 3:1 for sustainable growth","LTV, CAC",Gauge or ratio display
|
|
8
|
+
Net Revenue Retention,NRR,SaaS,"(MRR Start + Expansion - Contraction - Churn) / MRR Start * 100","Revenue retained from existing customers including expansion","> 100% means growing without new customers","Gross Retention, Expansion Revenue",Line chart
|
|
9
|
+
Average Revenue Per User,ARPU,SaaS,"Total Revenue / Total Customers","Average revenue generated per customer","Depends on pricing model","MRR, LTV, Pricing Tier Distribution",KPI card
|
|
10
|
+
Payback Period,Payback,SaaS,"CAC / (ARPU * Gross Margin)","Months to recover customer acquisition cost","< 12 months for healthy SaaS","CAC, ARPU, Gross Margin",KPI card
|
|
11
|
+
Gross Revenue Retention,GRR,SaaS,"(MRR Start - Contraction - Churn) / MRR Start * 100","Revenue retained excluding expansion","> 90% for enterprise SaaS","NRR, Churn",Line chart
|
|
12
|
+
Trial to Paid Conversion,Trial Conv,SaaS,"Paid Signups / Trial Signups * 100","Percentage of trials that become paying customers","15-25% for freemium, 40-60% for free trial","Activation Rate, Time to Conversion",Funnel chart
|
|
13
|
+
Activation Rate,Activation,SaaS,"Activated Users / Signups * 100","Users who reach aha moment","Varies by product - define activation first","Trial Conversion, Feature Adoption",Funnel chart
|
|
14
|
+
Daily Active Users,DAU,SaaS,"COUNT(DISTINCT users active today)","Users engaging with product daily","Depends on product type","MAU, DAU/MAU Ratio, Stickiness",Line chart
|
|
15
|
+
Monthly Active Users,MAU,SaaS,"COUNT(DISTINCT users active this month)","Users engaging with product monthly","Target based on TAM","DAU, Retention",Line chart
|
|
16
|
+
Stickiness Ratio,DAU/MAU,SaaS,"DAU / MAU * 100","How often users return - product habit","> 20% is good, > 50% is exceptional","DAU, MAU, Retention",Gauge
|
|
17
|
+
Expansion Revenue,Expansion MRR,SaaS,"MRR from upsells + cross-sells","Revenue growth from existing customers","> 30% of total growth is healthy","NRR, Upsell Rate, Cross-sell Rate",Stacked bar chart
|
|
18
|
+
Quick Ratio,Quick Ratio,SaaS,"(New MRR + Expansion MRR) / (Churned MRR + Contraction MRR)","Growth efficiency - new vs lost revenue","> 4:1 indicates strong growth","MRR Growth, Churn, Expansion",KPI card
|
|
19
|
+
Conversion Rate,Conv Rate,E-commerce,"Purchases / Sessions * 100","Percentage of visits that result in purchase","2-3% average, 5%+ is excellent","AOV, Traffic, Cart Abandonment",Line chart with trend
|
|
20
|
+
Average Order Value,AOV,E-commerce,"Total Revenue / Number of Orders","Average amount spent per order","Varies by industry - track trend over time","Conversion Rate, Items per Order",KPI card
|
|
21
|
+
Cart Abandonment Rate,Cart Abandon,E-commerce,"(Carts Created - Purchases) / Carts Created * 100","Percentage of shopping carts not completed","Industry average ~70%","Checkout Conversion, Payment Failure Rate",Funnel chart
|
|
22
|
+
Customer Acquisition Cost,CAC,E-commerce,"Marketing Spend / New Customers","Cost to acquire one new customer","Should be < first order profit","LTV, First Order Margin",KPI card
|
|
23
|
+
Customer Lifetime Value,CLV,E-commerce,"AOV * Purchase Frequency * Customer Lifespan","Total expected revenue from a customer","CLV:CAC > 3:1","CAC, Repeat Rate, AOV",KPI card
|
|
24
|
+
Repeat Purchase Rate,Repeat Rate,E-commerce,"Customers with 2+ Orders / Total Customers * 100","Percentage of customers who buy again","> 30% is healthy for most categories","Retention, CLV, Purchase Frequency",Line chart
|
|
25
|
+
Purchase Frequency,Purchase Freq,E-commerce,"Total Orders / Unique Customers","Average orders per customer per period","Varies by category - track trend","Repeat Rate, CLV",KPI card
|
|
26
|
+
Revenue Per Visitor,RPV,E-commerce,"Total Revenue / Total Visitors","Revenue generated per site visit","Track trend - combines traffic and conversion","Conversion Rate, AOV, Traffic",KPI card
|
|
27
|
+
Return Rate,Return Rate,E-commerce,"Returned Orders / Total Orders * 100","Percentage of orders returned","< 20% for most categories, < 30% for apparel","Net Revenue, Customer Complaints",Line chart
|
|
28
|
+
Gross Margin,Gross Margin,E-commerce,"(Revenue - COGS) / Revenue * 100","Profit after product costs","40-60% typical for retail","Net Margin, COGS, Pricing",KPI card
|
|
29
|
+
Inventory Turnover,Inv Turnover,E-commerce,"COGS / Average Inventory","How often inventory sells and is replaced","Higher is better - varies by category","Days of Inventory, Stock-outs",KPI card
|
|
30
|
+
Customer Satisfaction Score,CSAT,E-commerce,"Satisfied Responses / Total Responses * 100","Customer satisfaction with specific interaction","> 80% is good, > 90% is excellent","NPS, Reviews, Return Rate",Gauge
|
|
31
|
+
Net Promoter Score,NPS,E-commerce,"% Promoters - % Detractors","Likelihood to recommend (-100 to +100)","> 50 is excellent, > 70 is world class","CSAT, Reviews, Retention",Gauge
|
|
32
|
+
Gross Profit Margin,GPM,Finance,"(Revenue - COGS) / Revenue * 100","Profit after direct costs","Varies by industry - compare to peers","Net Margin, Operating Margin",KPI card
|
|
33
|
+
Net Profit Margin,NPM,Finance,"Net Income / Revenue * 100","Profit after all expenses","Positive and growing","Gross Margin, Operating Expenses",KPI card
|
|
34
|
+
Operating Margin,Op Margin,Finance,"Operating Income / Revenue * 100","Profit from core operations","Positive indicates viable business model","Gross Margin, SG&A",KPI card
|
|
35
|
+
Return on Investment,ROI,Finance,"(Gain - Cost) / Cost * 100","Return generated from investment","> 0% means profitable investment","IRR, Payback Period",KPI card
|
|
36
|
+
Return on Assets,ROA,Finance,"Net Income / Total Assets * 100","How efficiently assets generate profit","> 5% is generally good","ROE, Asset Turnover",KPI card
|
|
37
|
+
Return on Equity,ROE,Finance,"Net Income / Shareholders Equity * 100","Return generated on shareholder investment","> 15% is considered good","ROA, Leverage Ratio",KPI card
|
|
38
|
+
Current Ratio,Current Ratio,Finance,"Current Assets / Current Liabilities","Ability to pay short-term obligations","> 1.5 indicates healthy liquidity","Quick Ratio, Working Capital",KPI card
|
|
39
|
+
Debt to Equity Ratio,D/E Ratio,Finance,"Total Debt / Shareholders Equity","Financial leverage - debt vs equity","< 2 is generally healthy","Leverage, Interest Coverage",KPI card
|
|
40
|
+
Working Capital,Working Cap,Finance,"Current Assets - Current Liabilities","Operating liquidity available","Positive and growing","Cash Flow, Current Ratio",KPI card
|
|
41
|
+
Cash Flow from Operations,CFO,Finance,"Net cash from core business operations","Cash generated by business","Positive and growing","Net Income, Free Cash Flow",Line chart
|
|
42
|
+
Free Cash Flow,FCF,Finance,"CFO - Capital Expenditures","Cash available after investments","Positive for mature companies","CFO, CapEx, Dividends",Line chart
|
|
43
|
+
Revenue Growth Rate,Rev Growth,Finance,"(Current Revenue - Prior Revenue) / Prior Revenue * 100","Rate of revenue increase","Depends on stage - 20%+ for growth companies","MoM, YoY, CAGR",Line chart
|
|
44
|
+
Burn Rate,Burn Rate,Finance,"Monthly Operating Expenses - Monthly Revenue","Net cash consumed per month","Runway > 18 months is safe","Runway, Cash Balance",Line chart
|
|
45
|
+
Runway,Runway,Finance,"Cash Balance / Burn Rate","Months of operations remaining","> 18 months for fundraising","Burn Rate, Cash Balance",KPI card
|
|
46
|
+
Click-Through Rate,CTR,Marketing,"Clicks / Impressions * 100","Effectiveness of ad or content","2-5% for search ads, 0.5-1% for display","CPC, Conversion Rate",Line chart
|
|
47
|
+
Cost Per Click,CPC,Marketing,"Ad Spend / Clicks","Cost for each ad click","Varies by industry and platform","CTR, CPA, Quality Score",KPI card
|
|
48
|
+
Cost Per Acquisition,CPA,Marketing,"Marketing Spend / Conversions","Cost to acquire a customer or lead","Should be < customer value","CAC, ROAS, Conversion Rate",KPI card
|
|
49
|
+
Return on Ad Spend,ROAS,Marketing,"Revenue from Ads / Ad Spend","Revenue generated per ad dollar","3:1 or higher is typically profitable","ROI, CPA, Conversion Rate",KPI card
|
|
50
|
+
Cost Per Mille,CPM,Marketing,"(Ad Spend / Impressions) * 1000","Cost per thousand impressions","Varies by channel and targeting","CTR, Reach, Frequency",KPI card
|
|
51
|
+
Conversion Rate,Conv Rate,Marketing,"Conversions / Total Visitors * 100","Percentage of visitors taking desired action","Varies by goal - track improvement","CTR, CPA, Landing Page Views",Funnel chart
|
|
52
|
+
Lead to Customer Rate,Lead Conv,Marketing,"Customers / Leads * 100","Percentage of leads that become customers","10-20% for B2B SaaS","SQL Rate, Sales Cycle",Funnel chart
|
|
53
|
+
Marketing Qualified Leads,MQLs,Marketing,"COUNT leads meeting marketing criteria","Leads ready for sales follow-up","Growing with stable conversion rate","SQLs, Lead Velocity",Line chart
|
|
54
|
+
Sales Qualified Leads,SQLs,Marketing,"COUNT leads meeting sales criteria","Leads accepted by sales team","Conversion from MQL > 30%","MQLs, Opportunities, Win Rate",Line chart
|
|
55
|
+
Email Open Rate,Open Rate,Marketing,"Opens / Emails Delivered * 100","Percentage of emails opened","15-25% is typical","CTR, Unsubscribe Rate",Line chart
|
|
56
|
+
Email Click Rate,Email CTR,Marketing,"Clicks / Emails Delivered * 100","Percentage of emails clicked","2-5% is typical","Open Rate, Conversion Rate",Line chart
|
|
57
|
+
Unsubscribe Rate,Unsub Rate,Marketing,"Unsubscribes / Emails Delivered * 100","Percentage of recipients unsubscribing","< 0.5% is healthy","List Growth, Complaint Rate",Line chart
|
|
58
|
+
Social Engagement Rate,Engagement,Marketing,"(Likes + Comments + Shares) / Followers * 100","Interaction with social content","1-5% depending on platform","Reach, Impressions, Follower Growth",Bar chart
|
|
59
|
+
Bounce Rate,Bounce Rate,Marketing,"Single-page Sessions / Total Sessions * 100","Visitors leaving without interaction","< 40% for content sites, < 60% for landing pages","Time on Site, Pages per Session",Line chart
|
|
60
|
+
Pages per Session,Pages/Session,Marketing,"Total Pageviews / Sessions","Content engagement depth","> 2 indicates good engagement","Bounce Rate, Session Duration",KPI card
|
|
61
|
+
Average Session Duration,Avg Session,Marketing,"Total Session Duration / Sessions","Time spent on site per visit","> 2 minutes for content sites","Bounce Rate, Pages per Session",KPI card
|
|
62
|
+
Brand Awareness,Awareness,Marketing,"Survey-based or search volume","Percentage aware of brand","Track growth over time","Share of Voice, Brand Recall",Line chart
|
|
63
|
+
Share of Voice,SOV,Marketing,"Brand Mentions / Total Category Mentions * 100","Brand visibility vs competitors","Growing share indicates market gains","Brand Awareness, Market Share",Pie chart
|
|
64
|
+
Attribution Rate,Attribution,Marketing,"Attributed Conversions / Total Conversions * 100","Conversions trackable to marketing","Higher is better for optimization","Multi-touch Attribution",Stacked bar
|
|
65
|
+
Customer Retention Rate,Retention,General,"(Customers End - New Customers) / Customers Start * 100","Percentage of customers retained","Depends on industry - 90%+ for SaaS","Churn Rate, LTV",Line chart
|
|
66
|
+
Week 1 Retention,W1 Retention,General,"Users Active Week 1 / Signups * 100","Users returning after first week","25-40% for consumer apps","D1, D7, D30 Retention",Cohort heatmap
|
|
67
|
+
Month 1 Retention,M1 Retention,General,"Users Active Month 1 / Signups * 100","Users returning after first month","20-30% for consumer apps","Week 1, Month 3 Retention",Cohort heatmap
|
|
68
|
+
User Growth Rate,User Growth,General,"(Users End - Users Start) / Users Start * 100","Rate of user base expansion","Depends on stage","DAU, MAU, Signups",Line chart
|
|
69
|
+
Feature Adoption Rate,Feature Adopt,General,"Users Using Feature / Total Users * 100","Uptake of specific feature","> 50% for core features","Activation, Engagement",Bar chart
|
|
70
|
+
Time to Value,TTV,General,"Time from signup to first value moment","Speed of initial user success","Shorter is better - define value moment","Activation Rate, Onboarding Completion",Histogram
|
|
71
|
+
Support Tickets per User,Tickets/User,General,"Total Tickets / Active Users","Support burden per user","Decreasing over time is good","CSAT, Resolution Time",Line chart
|
|
72
|
+
Average Resolution Time,Avg Resolution,General,"Total Resolution Time / Tickets Resolved","Time to resolve support tickets","Depends on complexity - track trend","First Response Time, Ticket Volume",Line chart
|
|
73
|
+
Employee Net Promoter Score,eNPS,General,"% Promoters - % Detractors","Employee satisfaction and loyalty","> 20 is good, > 50 is excellent","Turnover Rate, Engagement",Gauge
|
|
74
|
+
Revenue per Employee,Rev/Employee,General,"Total Revenue / Number of Employees","Efficiency and productivity","$200K+ for SaaS","Headcount, Revenue Growth",KPI card
|
|
@@ -0,0 +1,31 @@
|
|
|
1
|
+
Pattern Name,Use Case,pandas Code,polars Code,Performance
|
|
2
|
+
Load CSV File,Read data from CSV file,"df = pd.read_csv('file.csv', parse_dates=['date_col'])","df = pl.read_csv('file.csv')","Use dtype parameter to reduce memory; usecols for subset"
|
|
3
|
+
Load Excel File,Read data from Excel file,"df = pd.read_excel('file.xlsx', sheet_name='Sheet1')","df = pl.read_excel('file.xlsx')","Specify sheet_name; engine='openpyxl' for xlsx"
|
|
4
|
+
Load Multiple CSVs,Combine CSVs from folder,"df = pd.concat([pd.read_csv(f) for f in glob.glob('*.csv')])","df = pl.concat([pl.read_csv(f) for f in glob.glob('*.csv')])","Use ignore_index=True to reset index"
|
|
5
|
+
Database Connection,Connect to SQL database,"from sqlalchemy import create_engine; engine = create_engine('postgresql://...'); df = pd.read_sql(query, engine)","df = pl.read_database(query, connection_uri)","Use connection pooling for multiple queries"
|
|
6
|
+
Filter Rows,Select rows matching condition,"df = df[df['status'] == 'active']; df = df[df['value'] > 100]","df = df.filter(pl.col('status') == 'active')","Chain filters or use & for multiple conditions"
|
|
7
|
+
Select Columns,Choose specific columns,"df = df[['col1', 'col2', 'col3']]; df = df.drop(columns=['unwanted'])","df = df.select(['col1', 'col2', 'col3'])","Select early to reduce memory"
|
|
8
|
+
Rename Columns,Change column names,"df = df.rename(columns={'old': 'new', 'old2': 'new2'})","df = df.rename({'old': 'new'})","Use dict for multiple renames"
|
|
9
|
+
Sort Data,Order by column values,"df = df.sort_values(['col1', 'col2'], ascending=[True, False])","df = df.sort(['col1', 'col2'], descending=[False, True])","Sort after filtering for efficiency"
|
|
10
|
+
Group By Aggregate,Aggregate data by groups,"df.groupby('category').agg({'value': 'sum', 'count': 'size'})","df.group_by('category').agg([pl.col('value').sum(), pl.len()])","Named aggregations: agg(total=('value', 'sum'))"
|
|
11
|
+
Pivot Table,Create pivot table,"df.pivot_table(index='row', columns='col', values='value', aggfunc='sum', fill_value=0)","df.pivot(index='row', columns='col', values='value')","Use margins=True for totals"
|
|
12
|
+
Melt Unpivot,Convert wide to long format,"df.melt(id_vars=['id'], value_vars=['col1', 'col2'], var_name='variable', value_name='value')","df.melt(id_vars=['id'])","Inverse of pivot operation"
|
|
13
|
+
Join Merge,Combine two dataframes,"df = pd.merge(df1, df2, on='key', how='left')","df = df1.join(df2, on='key', how='left')","Validate: how='left'/'right'/'inner'/'outer'"
|
|
14
|
+
Concatenate DataFrames,Stack dataframes vertically,"df = pd.concat([df1, df2], ignore_index=True)","df = pl.concat([df1, df2])","axis=0 for rows; axis=1 for columns"
|
|
15
|
+
Apply Function,Transform values with function,"df['new'] = df['col'].apply(lambda x: x * 2)","df = df.with_columns(pl.col('col').map_elements(lambda x: x * 2))","Vectorized operations are faster than apply"
|
|
16
|
+
Rolling Window,Calculate rolling statistics,"df['rolling_mean'] = df['value'].rolling(window=7).mean()","df.with_columns(pl.col('value').rolling_mean(7))","Specify min_periods for edge handling"
|
|
17
|
+
Date Extraction,Extract date components,"df['year'] = df['date'].dt.year; df['month'] = df['date'].dt.month; df['weekday'] = df['date'].dt.dayofweek","df.with_columns(pl.col('date').dt.year().alias('year'))","dt accessor for date operations"
|
|
18
|
+
Date Difference,Calculate days between dates,"df['days'] = (df['end_date'] - df['start_date']).dt.days","df.with_columns((pl.col('end_date') - pl.col('start_date')).dt.total_days())","Result is timedelta; use .days for integer"
|
|
19
|
+
Lag/Lead Values,Get previous or next row values,"df['prev_value'] = df.groupby('id')['value'].shift(1)","df.with_columns(pl.col('value').shift(1).over('id'))","shift(-1) for next value (lead)"
|
|
20
|
+
Cumulative Sum,Running total,"df['cumsum'] = df.groupby('category')['value'].cumsum()","df.with_columns(pl.col('value').cum_sum().over('category'))","Order matters - sort first if needed"
|
|
21
|
+
Rank Values,Rank within groups,"df['rank'] = df.groupby('category')['value'].rank(ascending=False)","df.with_columns(pl.col('value').rank().over('category'))","method='min'/'dense'/'first' for tie handling"
|
|
22
|
+
Percent of Total,Calculate percentage of group total,"df['pct'] = df['value'] / df.groupby('category')['value'].transform('sum')","df.with_columns((pl.col('value') / pl.col('value').sum().over('category')).alias('pct'))","transform applies group result back to rows"
|
|
23
|
+
Fill Missing Forward,Forward fill nulls,"df['col'] = df['col'].fillna(method='ffill')","df.with_columns(pl.col('col').forward_fill())","bfill for backward fill"
|
|
24
|
+
Replace Values,Map values to new values,"df['col'] = df['col'].replace({'old1': 'new1', 'old2': 'new2'})","df.with_columns(pl.col('col').replace({'old1': 'new1'}))","Use map for complex transformations"
|
|
25
|
+
Binning Discretize,Convert continuous to categorical,"df['bin'] = pd.cut(df['value'], bins=[0, 10, 50, 100], labels=['low', 'med', 'high'])","df.with_columns(pl.col('value').cut([10, 50, 100]))","qcut for equal-frequency bins"
|
|
26
|
+
One Hot Encoding,Convert categorical to dummies,"df = pd.get_dummies(df, columns=['category'], prefix='cat')","df.to_dummies(columns=['category'])","drop_first=True to avoid multicollinearity"
|
|
27
|
+
Value Counts,Count occurrences of each value,"df['col'].value_counts(normalize=True)","df['col'].value_counts()","normalize=True for percentages"
|
|
28
|
+
Describe Statistics,Summary statistics,"df.describe(include='all', percentiles=[.25, .5, .75])","df.describe()","include='all' for non-numeric columns"
|
|
29
|
+
Correlation Matrix,Calculate correlations,"df.select_dtypes(include='number').corr()","df.select(pl.numeric_columns()).pearson_corr()","Use method='spearman' for non-linear"
|
|
30
|
+
Cross Tabulation,Frequency table for two columns,"pd.crosstab(df['col1'], df['col2'], normalize='index')","N/A - use group_by and pivot","normalize='all'/'index'/'columns'"
|
|
31
|
+
Sample Data,Random sample of rows,"df.sample(n=1000) or df.sample(frac=0.1)","df.sample(n=1000)","random_state for reproducibility"
|
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
Category,Guideline,Do,Don't,Example
|
|
2
|
+
Layout,5-Second Rule,"Put most important insight at top-left where eyes land first","Bury key metrics at bottom of page","CEO should see revenue trend in 5 seconds without scrolling"
|
|
3
|
+
Layout,Inverted Pyramid,"Structure: KPIs at top → Trends in middle → Details at bottom","Start with detailed tables; hide summary at bottom","Top row: 3-5 KPI cards; Middle: Line charts; Bottom: Detailed table"
|
|
4
|
+
Layout,Visual Hierarchy,"Use size and position to indicate importance","Equal sizing for all elements; no focal point","Largest chart for most important metric; smaller for supporting"
|
|
5
|
+
Layout,White Space,"Give elements room to breathe; avoid cramped layouts","Fill every pixel with data; no margins","Minimum 16px padding between dashboard sections"
|
|
6
|
+
Layout,Grid System,"Align elements to consistent grid for clean appearance","Random positioning of elements","Use 12-column grid; align chart edges"
|
|
7
|
+
Layout,Responsive Design,"Design for multiple screen sizes; test on mobile","Fixed-width layouts that break on small screens","Cards stack vertically on mobile; charts resize"
|
|
8
|
+
Color,Consistent Meaning,"Use same colors for same meanings throughout","Red means growth in one place and decline in another","Red = negative/alert; Green = positive/growth everywhere"
|
|
9
|
+
Color,Limit Palette,"Use 3-5 colors maximum per visualization","Rainbow of colors with no meaning","Primary brand color + 2-3 supporting colors"
|
|
10
|
+
Color,Colorblind Safe,"Test with colorblind simulation; avoid red-green only","Rely solely on red vs green for meaning","Use patterns, labels, or blue-orange instead"
|
|
11
|
+
Color,Sequential Palettes,"Use for continuous data showing magnitude","Categorical colors for numerical ranges","Light to dark blue for low to high values"
|
|
12
|
+
Color,Diverging Palettes,"Use for data with meaningful midpoint (pos/neg)","Sequential palette for data diverging from center","Blue-white-red for profit/loss or sentiment"
|
|
13
|
+
Color,Background Contrast,"Ensure sufficient contrast for readability","Light text on light background; low contrast","WCAG AA contrast ratio minimum (4.5:1 for text)"
|
|
14
|
+
Typography,Hierarchy,"Use font size to establish content hierarchy","Same size for titles, labels, and values","Title: 24px; Subtitle: 16px; Body: 14px; Labels: 12px"
|
|
15
|
+
Typography,Readability,"Choose readable fonts; limit to 2 families","Decorative fonts for data; too many font families","Sans-serif for data (Inter, Roboto); consistent weights"
|
|
16
|
+
Typography,Number Formatting,"Format numbers for readability: 1.2M not 1234567","Raw unformatted numbers","$1.2M; 45.3%; 1,234 users"
|
|
17
|
+
Typography,Axis Labels,"Label axes clearly; include units","Unlabeled axes; cryptic abbreviations","Revenue (USD, Millions) not just 'Rev'"
|
|
18
|
+
Interactivity,Drill Down,"Let users click to explore underlying data","Force users to ask for details separately","Click bar to see breakdown by category"
|
|
19
|
+
Interactivity,Filters,"Provide relevant filters; show active filter state","Too many filters; hidden filter state","Date range, region, segment filters clearly visible"
|
|
20
|
+
Interactivity,Tooltips,"Show details on hover without cluttering view","Tooltips blocking other content","Hover shows: Date, Value, % Change, Comparison"
|
|
21
|
+
Interactivity,Linked Views,"Connect related charts; filter one affects others","Isolated charts with no relationship","Clicking segment in pie filters line chart"
|
|
22
|
+
Content,Title Everything,"Every chart needs a clear descriptive title","Untitled charts relying on context","Revenue by Region (Q4 2024) not just 'Revenue'"
|
|
23
|
+
Content,Annotate Insights,"Highlight anomalies and key points","Let users discover insights alone","Arrow pointing to spike with explanation text"
|
|
24
|
+
Content,Show Context,"Include comparison: vs target, last period, benchmark","Single number with no reference point","Revenue: $1.2M (↑ 15% YoY, 5% above target)"
|
|
25
|
+
Content,Data Freshness,"Clearly show when data was last updated","Stale data without indication","Last updated: 2024-01-15 08:00 UTC"
|
|
26
|
+
Content,Source Attribution,"Cite data source for credibility","Unknown data origin","Source: Sales Database, Marketing API"
|
|
Binary file
|
|
@@ -0,0 +1,238 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
# -*- coding: utf-8 -*-
|
|
3
|
+
"""
|
|
4
|
+
CrushData AI Core - BM25 search engine for data analyst workflows
|
|
5
|
+
"""
|
|
6
|
+
|
|
7
|
+
import csv
|
|
8
|
+
import re
|
|
9
|
+
from pathlib import Path
|
|
10
|
+
from math import log
|
|
11
|
+
from collections import defaultdict
|
|
12
|
+
|
|
13
|
+
# ============ CONFIGURATION ============
|
|
14
|
+
DATA_DIR = Path(__file__).parent.parent / "data"
|
|
15
|
+
MAX_RESULTS = 3
|
|
16
|
+
|
|
17
|
+
CSV_CONFIG = {
|
|
18
|
+
"workflow": {
|
|
19
|
+
"file": "workflows.csv",
|
|
20
|
+
"search_cols": ["Workflow Name", "Step Name", "Description", "Questions to Ask"],
|
|
21
|
+
"output_cols": ["Workflow Name", "Step Number", "Step Name", "Description", "Questions to Ask", "Tools/Commands", "Outputs", "Common Mistakes"]
|
|
22
|
+
},
|
|
23
|
+
"metric": {
|
|
24
|
+
"file": "metrics.csv",
|
|
25
|
+
"search_cols": ["Metric Name", "Abbreviation", "Industry", "Interpretation"],
|
|
26
|
+
"output_cols": ["Metric Name", "Abbreviation", "Industry", "Formula", "Interpretation", "Good Benchmark", "Related Metrics", "Visualization"]
|
|
27
|
+
},
|
|
28
|
+
"chart": {
|
|
29
|
+
"file": "charts.csv",
|
|
30
|
+
"search_cols": ["Chart Type", "Best For", "Data Type", "Comparison Type"],
|
|
31
|
+
"output_cols": ["Chart Type", "Best For", "Data Type", "Comparison Type", "Python Code", "Color Guidance", "Accessibility", "Dashboard Tip"]
|
|
32
|
+
},
|
|
33
|
+
"cleaning": {
|
|
34
|
+
"file": "cleaning.csv",
|
|
35
|
+
"search_cols": ["Issue Type", "Detection Method", "Solution"],
|
|
36
|
+
"output_cols": ["Issue Type", "Detection Method", "Solution", "Python Code", "SQL Code", "Impact"]
|
|
37
|
+
},
|
|
38
|
+
"sql": {
|
|
39
|
+
"file": "sql-patterns.csv",
|
|
40
|
+
"search_cols": ["Pattern Name", "Use Case", "SQL Code"],
|
|
41
|
+
"output_cols": ["Pattern Name", "Use Case", "SQL Code", "PostgreSQL", "BigQuery", "Performance"]
|
|
42
|
+
},
|
|
43
|
+
"python": {
|
|
44
|
+
"file": "python-patterns.csv",
|
|
45
|
+
"search_cols": ["Pattern Name", "Use Case", "pandas Code"],
|
|
46
|
+
"output_cols": ["Pattern Name", "Use Case", "pandas Code", "polars Code", "Performance"]
|
|
47
|
+
},
|
|
48
|
+
"database": {
|
|
49
|
+
"file": "databases.csv",
|
|
50
|
+
"search_cols": ["Database", "Category", "Guideline", "Do", "Don't"],
|
|
51
|
+
"output_cols": ["Database", "Category", "Guideline", "Do", "Don't", "Code Example"]
|
|
52
|
+
},
|
|
53
|
+
"report": {
|
|
54
|
+
"file": "report-ux.csv",
|
|
55
|
+
"search_cols": ["Category", "Guideline", "Do", "Don't"],
|
|
56
|
+
"output_cols": ["Category", "Guideline", "Do", "Don't", "Example"]
|
|
57
|
+
},
|
|
58
|
+
"validation": {
|
|
59
|
+
"file": "validation.csv",
|
|
60
|
+
"search_cols": ["Mistake Type", "Description", "Symptoms"],
|
|
61
|
+
"output_cols": ["Mistake Type", "Description", "Symptoms", "Prevention Query", "User Question"]
|
|
62
|
+
}
|
|
63
|
+
}
|
|
64
|
+
|
|
65
|
+
INDUSTRY_CONFIG = {
|
|
66
|
+
"saas": {"file": "industries/saas.csv"},
|
|
67
|
+
"ecommerce": {"file": "industries/ecommerce.csv"},
|
|
68
|
+
"finance": {"file": "industries/finance.csv"},
|
|
69
|
+
"marketing": {"file": "industries/marketing.csv"}
|
|
70
|
+
}
|
|
71
|
+
|
|
72
|
+
# Common columns for all industry files
|
|
73
|
+
_INDUSTRY_COLS = {
|
|
74
|
+
"search_cols": ["Metric Name", "Abbreviation", "Category", "Interpretation"],
|
|
75
|
+
"output_cols": ["Metric Name", "Abbreviation", "Category", "Formula", "Interpretation", "Good Benchmark", "Related Metrics", "Visualization"]
|
|
76
|
+
}
|
|
77
|
+
|
|
78
|
+
AVAILABLE_INDUSTRIES = list(INDUSTRY_CONFIG.keys())
|
|
79
|
+
|
|
80
|
+
|
|
81
|
+
# ============ BM25 IMPLEMENTATION ============
|
|
82
|
+
class BM25:
|
|
83
|
+
"""BM25 ranking algorithm for text search"""
|
|
84
|
+
|
|
85
|
+
def __init__(self, k1=1.5, b=0.75):
|
|
86
|
+
self.k1 = k1
|
|
87
|
+
self.b = b
|
|
88
|
+
self.corpus = []
|
|
89
|
+
self.doc_lengths = []
|
|
90
|
+
self.avgdl = 0
|
|
91
|
+
self.idf = {}
|
|
92
|
+
self.doc_freqs = defaultdict(int)
|
|
93
|
+
self.N = 0
|
|
94
|
+
|
|
95
|
+
def tokenize(self, text):
|
|
96
|
+
"""Lowercase, split, remove punctuation, filter short words"""
|
|
97
|
+
text = re.sub(r'[^\w\s]', ' ', str(text).lower())
|
|
98
|
+
return [w for w in text.split() if len(w) > 2]
|
|
99
|
+
|
|
100
|
+
def fit(self, documents):
|
|
101
|
+
"""Build BM25 index from documents"""
|
|
102
|
+
self.corpus = [self.tokenize(doc) for doc in documents]
|
|
103
|
+
self.N = len(self.corpus)
|
|
104
|
+
if self.N == 0:
|
|
105
|
+
return
|
|
106
|
+
self.doc_lengths = [len(doc) for doc in self.corpus]
|
|
107
|
+
self.avgdl = sum(self.doc_lengths) / self.N
|
|
108
|
+
|
|
109
|
+
for doc in self.corpus:
|
|
110
|
+
seen = set()
|
|
111
|
+
for word in doc:
|
|
112
|
+
if word not in seen:
|
|
113
|
+
self.doc_freqs[word] += 1
|
|
114
|
+
seen.add(word)
|
|
115
|
+
|
|
116
|
+
for word, freq in self.doc_freqs.items():
|
|
117
|
+
self.idf[word] = log((self.N - freq + 0.5) / (freq + 0.5) + 1)
|
|
118
|
+
|
|
119
|
+
def score(self, query):
|
|
120
|
+
"""Score all documents against query"""
|
|
121
|
+
query_tokens = self.tokenize(query)
|
|
122
|
+
scores = []
|
|
123
|
+
|
|
124
|
+
for idx, doc in enumerate(self.corpus):
|
|
125
|
+
score = 0
|
|
126
|
+
doc_len = self.doc_lengths[idx]
|
|
127
|
+
term_freqs = defaultdict(int)
|
|
128
|
+
for word in doc:
|
|
129
|
+
term_freqs[word] += 1
|
|
130
|
+
|
|
131
|
+
for token in query_tokens:
|
|
132
|
+
if token in self.idf:
|
|
133
|
+
tf = term_freqs[token]
|
|
134
|
+
idf = self.idf[token]
|
|
135
|
+
numerator = tf * (self.k1 + 1)
|
|
136
|
+
denominator = tf + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)
|
|
137
|
+
score += idf * numerator / denominator
|
|
138
|
+
|
|
139
|
+
scores.append((idx, score))
|
|
140
|
+
|
|
141
|
+
return sorted(scores, key=lambda x: x[1], reverse=True)
|
|
142
|
+
|
|
143
|
+
|
|
144
|
+
# ============ SEARCH FUNCTIONS ============
|
|
145
|
+
def _load_csv(filepath):
|
|
146
|
+
"""Load CSV and return list of dicts"""
|
|
147
|
+
with open(filepath, 'r', encoding='utf-8') as f:
|
|
148
|
+
return list(csv.DictReader(f))
|
|
149
|
+
|
|
150
|
+
|
|
151
|
+
def _search_csv(filepath, search_cols, output_cols, query, max_results):
|
|
152
|
+
"""Core search function using BM25"""
|
|
153
|
+
if not filepath.exists():
|
|
154
|
+
return []
|
|
155
|
+
|
|
156
|
+
data = _load_csv(filepath)
|
|
157
|
+
|
|
158
|
+
# Build documents from search columns
|
|
159
|
+
documents = [" ".join(str(row.get(col, "")) for col in search_cols) for row in data]
|
|
160
|
+
|
|
161
|
+
# BM25 search
|
|
162
|
+
bm25 = BM25()
|
|
163
|
+
bm25.fit(documents)
|
|
164
|
+
ranked = bm25.score(query)
|
|
165
|
+
|
|
166
|
+
# Get top results with score > 0
|
|
167
|
+
results = []
|
|
168
|
+
for idx, score in ranked[:max_results]:
|
|
169
|
+
if score > 0:
|
|
170
|
+
row = data[idx]
|
|
171
|
+
results.append({col: row.get(col, "") for col in output_cols if col in row})
|
|
172
|
+
|
|
173
|
+
return results
|
|
174
|
+
|
|
175
|
+
|
|
176
|
+
def detect_domain(query):
|
|
177
|
+
"""Auto-detect the most relevant domain from query"""
|
|
178
|
+
query_lower = query.lower()
|
|
179
|
+
|
|
180
|
+
domain_keywords = {
|
|
181
|
+
"workflow": ["workflow", "process", "step", "eda", "dashboard", "cohort", "funnel", "analysis", "pipeline"],
|
|
182
|
+
"metric": ["metric", "kpi", "mrr", "arr", "churn", "cac", "ltv", "conversion", "rate", "ratio"],
|
|
183
|
+
"chart": ["chart", "graph", "visualization", "plot", "bar", "line", "pie", "heatmap", "scatter"],
|
|
184
|
+
"cleaning": ["clean", "missing", "null", "duplicate", "outlier", "impute", "data quality"],
|
|
185
|
+
"sql": ["sql", "query", "join", "window", "aggregate", "cte", "subquery", "partition"],
|
|
186
|
+
"python": ["python", "pandas", "polars", "dataframe", "pivot", "groupby", "merge"],
|
|
187
|
+
"database": ["postgres", "bigquery", "snowflake", "mysql", "database", "connection", "warehouse"],
|
|
188
|
+
"report": ["dashboard", "report", "layout", "ux", "design", "color", "visual"],
|
|
189
|
+
"validation": ["mistake", "error", "sanity", "check", "validate", "verify", "wrong"]
|
|
190
|
+
}
|
|
191
|
+
|
|
192
|
+
scores = {domain: sum(1 for kw in keywords if kw in query_lower) for domain, keywords in domain_keywords.items()}
|
|
193
|
+
best = max(scores, key=scores.get)
|
|
194
|
+
return best if scores[best] > 0 else "workflow"
|
|
195
|
+
|
|
196
|
+
|
|
197
|
+
def search(query, domain=None, max_results=MAX_RESULTS):
|
|
198
|
+
"""Main search function with auto-domain detection"""
|
|
199
|
+
if domain is None:
|
|
200
|
+
domain = detect_domain(query)
|
|
201
|
+
|
|
202
|
+
config = CSV_CONFIG.get(domain, CSV_CONFIG["workflow"])
|
|
203
|
+
filepath = DATA_DIR / config["file"]
|
|
204
|
+
|
|
205
|
+
if not filepath.exists():
|
|
206
|
+
return {"error": f"File not found: {filepath}", "domain": domain}
|
|
207
|
+
|
|
208
|
+
results = _search_csv(filepath, config["search_cols"], config["output_cols"], query, max_results)
|
|
209
|
+
|
|
210
|
+
return {
|
|
211
|
+
"domain": domain,
|
|
212
|
+
"query": query,
|
|
213
|
+
"file": config["file"],
|
|
214
|
+
"count": len(results),
|
|
215
|
+
"results": results
|
|
216
|
+
}
|
|
217
|
+
|
|
218
|
+
|
|
219
|
+
def search_industry(query, industry, max_results=MAX_RESULTS):
|
|
220
|
+
"""Search industry-specific metrics"""
|
|
221
|
+
if industry not in INDUSTRY_CONFIG:
|
|
222
|
+
return {"error": f"Unknown industry: {industry}. Available: {', '.join(AVAILABLE_INDUSTRIES)}"}
|
|
223
|
+
|
|
224
|
+
filepath = DATA_DIR / INDUSTRY_CONFIG[industry]["file"]
|
|
225
|
+
|
|
226
|
+
if not filepath.exists():
|
|
227
|
+
return {"error": f"Industry file not found: {filepath}", "industry": industry}
|
|
228
|
+
|
|
229
|
+
results = _search_csv(filepath, _INDUSTRY_COLS["search_cols"], _INDUSTRY_COLS["output_cols"], query, max_results)
|
|
230
|
+
|
|
231
|
+
return {
|
|
232
|
+
"domain": "industry",
|
|
233
|
+
"industry": industry,
|
|
234
|
+
"query": query,
|
|
235
|
+
"file": INDUSTRY_CONFIG[industry]["file"],
|
|
236
|
+
"count": len(results),
|
|
237
|
+
"results": results
|
|
238
|
+
}
|
|
@@ -0,0 +1,61 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
# -*- coding: utf-8 -*-
|
|
3
|
+
"""
|
|
4
|
+
CrushData AI Search - CLI entry point for data analyst search
|
|
5
|
+
Usage: python search.py "<query>" [--domain <domain>] [--industry <industry>] [--max-results 3]
|
|
6
|
+
|
|
7
|
+
Domains: workflow, metric, chart, cleaning, sql, python, database, report, validation
|
|
8
|
+
Industries: saas, ecommerce, finance, marketing
|
|
9
|
+
"""
|
|
10
|
+
|
|
11
|
+
import argparse
|
|
12
|
+
from core import CSV_CONFIG, AVAILABLE_INDUSTRIES, MAX_RESULTS, search, search_industry
|
|
13
|
+
|
|
14
|
+
|
|
15
|
+
def format_output(result):
|
|
16
|
+
"""Format results for AI consumption (token-optimized)"""
|
|
17
|
+
if "error" in result:
|
|
18
|
+
return f"Error: {result['error']}"
|
|
19
|
+
|
|
20
|
+
output = []
|
|
21
|
+
if result.get("industry"):
|
|
22
|
+
output.append(f"## CrushData AI Industry Metrics")
|
|
23
|
+
output.append(f"**Industry:** {result['industry']} | **Query:** {result['query']}")
|
|
24
|
+
else:
|
|
25
|
+
output.append(f"## CrushData AI Search Results")
|
|
26
|
+
output.append(f"**Domain:** {result['domain']} | **Query:** {result['query']}")
|
|
27
|
+
output.append(f"**Source:** {result['file']} | **Found:** {result['count']} results\n")
|
|
28
|
+
|
|
29
|
+
for i, row in enumerate(result['results'], 1):
|
|
30
|
+
output.append(f"### Result {i}")
|
|
31
|
+
for key, value in row.items():
|
|
32
|
+
value_str = str(value)
|
|
33
|
+
if len(value_str) > 300:
|
|
34
|
+
value_str = value_str[:300] + "..."
|
|
35
|
+
output.append(f"- **{key}:** {value_str}")
|
|
36
|
+
output.append("")
|
|
37
|
+
|
|
38
|
+
return "\n".join(output)
|
|
39
|
+
|
|
40
|
+
|
|
41
|
+
if __name__ == "__main__":
|
|
42
|
+
parser = argparse.ArgumentParser(description="CrushData AI Search")
|
|
43
|
+
parser.add_argument("query", help="Search query")
|
|
44
|
+
parser.add_argument("--domain", "-d", choices=list(CSV_CONFIG.keys()), help="Search domain")
|
|
45
|
+
parser.add_argument("--industry", "-i", choices=AVAILABLE_INDUSTRIES, help="Industry-specific search (saas, ecommerce, finance, marketing)")
|
|
46
|
+
parser.add_argument("--max-results", "-n", type=int, default=MAX_RESULTS, help="Max results (default: 3)")
|
|
47
|
+
parser.add_argument("--json", action="store_true", help="Output as JSON")
|
|
48
|
+
|
|
49
|
+
args = parser.parse_args()
|
|
50
|
+
|
|
51
|
+
# Industry search takes priority
|
|
52
|
+
if args.industry:
|
|
53
|
+
result = search_industry(args.query, args.industry, args.max_results)
|
|
54
|
+
else:
|
|
55
|
+
result = search(args.query, args.domain, args.max_results)
|
|
56
|
+
|
|
57
|
+
if args.json:
|
|
58
|
+
import json
|
|
59
|
+
print(json.dumps(result, indent=2, ensure_ascii=False))
|
|
60
|
+
else:
|
|
61
|
+
print(format_output(result))
|
|
@@ -0,0 +1,36 @@
|
|
|
1
|
+
Pattern Name,Use Case,SQL Code,PostgreSQL,BigQuery,Performance
|
|
2
|
+
Running Total,"Cumulative sum over time","SUM(value) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING)","Same","Same","Efficient with index on date column"
|
|
3
|
+
Running Average,"Moving average over all prior rows","AVG(value) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING)","Same","Same","Consider fixed window for performance"
|
|
4
|
+
Rolling Window Average,"N-period moving average","AVG(value) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)","Same","Same","Fixed window more efficient than unbounded"
|
|
5
|
+
Lag Previous Value,"Compare to previous row","LAG(value, 1) OVER (ORDER BY date)","Same","Same","Useful for period-over-period calculations"
|
|
6
|
+
Lead Next Value,"Look ahead to next row","LEAD(value, 1) OVER (ORDER BY date)","Same","Same","Use for forward-looking comparisons"
|
|
7
|
+
Year over Year,"Compare to same period last year","LAG(value, 12) OVER (ORDER BY month) for monthly; or self-join on date - INTERVAL '1 year'","Same; use date_trunc('year', date)","DATE_SUB(date, INTERVAL 1 YEAR)","Index on date; pre-aggregate to month level"
|
|
8
|
+
Month over Month,"Compare to previous month","LAG(value, 1) OVER (ORDER BY month)","Same","Same","Pre-aggregate daily to monthly first"
|
|
9
|
+
Percent Change,"Calculate growth rate","(value - LAG(value, 1) OVER (ORDER BY date)) / NULLIF(LAG(value, 1) OVER (ORDER BY date), 0) * 100","Same","Same","Handle divide by zero with NULLIF"
|
|
10
|
+
Rank,"Rank rows by value","RANK() OVER (ORDER BY value DESC)","Same","Same","Gaps in ranking for ties"
|
|
11
|
+
Dense Rank,"Rank without gaps","DENSE_RANK() OVER (ORDER BY value DESC)","Same","Same","No gaps - consecutive numbers"
|
|
12
|
+
Row Number,"Unique row identifier","ROW_NUMBER() OVER (ORDER BY date)","Same","Same","Good for pagination"
|
|
13
|
+
Percent Rank,"Percentile position","PERCENT_RANK() OVER (ORDER BY value)","Same","Same","Returns 0-1 scale"
|
|
14
|
+
NTILE Buckets,"Divide into N equal groups","NTILE(4) OVER (ORDER BY value)","Same","Same","Useful for quartile analysis"
|
|
15
|
+
First Value in Group,"Get first value per partition","FIRST_VALUE(value) OVER (PARTITION BY group ORDER BY date)","Same","Same","Useful for cohort first action"
|
|
16
|
+
Last Value in Group,"Get last value per partition","LAST_VALUE(value) OVER (PARTITION BY group ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)","Same","Same","Must specify frame for last value"
|
|
17
|
+
Deduplication,"Get latest record per entity","WITH ranked AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) as rn FROM table) SELECT * FROM ranked WHERE rn = 1","Same","Use QUALIFY instead: SELECT * FROM table QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) = 1","Index on partition and order columns"
|
|
18
|
+
Gap Fill Dates,"Fill missing dates in time series","Use generate_series to create date spine then LEFT JOIN","generate_series(start_date, end_date, '1 day'::interval)","GENERATE_DATE_ARRAY(start_date, end_date)","Generate date spine first, then join data"
|
|
19
|
+
Cohort Definition,"Assign users to signup cohort","SELECT user_id, DATE_TRUNC('month', MIN(signup_date)) OVER (PARTITION BY user_id) as cohort FROM events","Same","DATE_TRUNC(signup_date, MONTH)","Calculate cohort once and store"
|
|
20
|
+
Retention Cohort,"Calculate retention by cohort","WITH cohorts AS (...), activity AS (...) SELECT cohort, DATEDIFF(activity_month, cohort) as period, COUNT(DISTINCT user_id)","Same; use date_part('month', age(...))","DATE_DIFF(activity_date, cohort_date, MONTH)","Pre-compute user cohorts for efficiency"
|
|
21
|
+
Funnel Sequential,"Ensure funnel steps happen in order","WITH step1 AS (...), step2 AS (... WHERE step2_time > step1_time) SELECT ...","Same","Same","Index on user_id and timestamp"
|
|
22
|
+
Funnel Conversion,"Count users at each funnel step","SELECT 'Step1' as step, COUNT(DISTINCT user_id) UNION ALL SELECT 'Step2', COUNT(DISTINCT CASE WHEN completed_step2 THEN user_id END)","Same","Same","One pass aggregation is efficient"
|
|
23
|
+
Sessionization,"Group events into sessions by gap","SUM(CASE WHEN time_since_last > 30 THEN 1 ELSE 0 END) OVER (PARTITION BY user ORDER BY timestamp) as session_id","Same","Same","30 minute gap is common default"
|
|
24
|
+
Pivot Dynamic,"Pivot rows to columns dynamically","Use CASE WHEN for known values or crosstab() extension","crosstab() function from tablefunc","PIVOT operator available","Static CASE WHEN is more portable"
|
|
25
|
+
Unpivot,"Convert columns to rows","Use UNION ALL for each column or UNPIVOT keyword","UNNEST with ARRAY","UNPIVOT operator","UNION ALL works everywhere"
|
|
26
|
+
Self Join for Pairs,"Find related records","SELECT a.*, b.* FROM table a JOIN table b ON a.user_id = b.user_id AND a.id < b.id","Same","Same","Use a.id < b.id to avoid duplicates"
|
|
27
|
+
Recursive CTE,"Hierarchical data traversal","WITH RECURSIVE cte AS (base UNION ALL recursive) SELECT * FROM cte","Same","Does not support - use CONNECT BY or flatten","Limit recursion depth for safety"
|
|
28
|
+
Anti Join,"Find records NOT in another table","SELECT * FROM a WHERE NOT EXISTS (SELECT 1 FROM b WHERE a.id = b.id)","Same; also LEFT JOIN WHERE b.id IS NULL","Same","NOT EXISTS often most efficient"
|
|
29
|
+
Conditional Aggregation,"Aggregate with conditions","SUM(CASE WHEN status = 'active' THEN amount ELSE 0 END)","Same; also FILTER clause: SUM(amount) FILTER (WHERE status = 'active')","COUNTIF, SUMIF available","CASE WHEN is most portable"
|
|
30
|
+
Distinct Count Per Group,"Count distinct within groups","COUNT(DISTINCT user_id) OVER (PARTITION BY category)","Same","Same; also APPROX_COUNT_DISTINCT for estimates","Expensive - consider HyperLogLog"
|
|
31
|
+
Median Calculation,"Find median value","PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value)","Same","APPROX_QUANTILES(value, 100)[OFFSET(50)]","Exact median is expensive; approximate is faster"
|
|
32
|
+
Mode Calculation,"Find most frequent value","SELECT value, COUNT(*) as cnt FROM table GROUP BY value ORDER BY cnt DESC LIMIT 1","Also: mode() WITHIN GROUP (ORDER BY value)","APPROX_TOP_COUNT for approximate","Order by count descending, limit 1"
|
|
33
|
+
Time Bucket,"Group timestamps into buckets","DATE_TRUNC('hour', timestamp)","date_trunc('hour', ts)","TIMESTAMP_TRUNC(ts, HOUR)","Reduces granularity for aggregation"
|
|
34
|
+
Date Spine Join,"Ensure all dates present","SELECT d.date, COALESCE(t.value, 0) FROM date_spine d LEFT JOIN table t ON d.date = t.date","generate_series for date spine","GENERATE_DATE_ARRAY","Create date dimension table"
|
|
35
|
+
Weighted Average,"Calculate weighted average","SUM(value * weight) / NULLIF(SUM(weight), 0)","Same","Same","Handle zero weight with NULLIF"
|
|
36
|
+
Compound Growth Rate,"Calculate CAGR","POWER(end_value / start_value, 1.0 / years) - 1","Same; use POWER() function","POWER() function","Need start, end, and period count"
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
Mistake Type,Description,Symptoms,Prevention Query,User Question
|
|
2
|
+
Duplicate Inflation,Counting same record multiple times due to duplicates or join multiplication,"Total much higher than expected; sum doesn't match source","SELECT id, COUNT(*) as cnt FROM table GROUP BY id HAVING COUNT(*) > 1","Does the total of X seem reasonable compared to other reports?"
|
|
3
|
+
Wrong Granularity,Aggregating at wrong level (user vs session vs event),"Numbers don't match other reports; unexpected row counts","SELECT COUNT(*), COUNT(DISTINCT user_id), COUNT(DISTINCT session_id) FROM table","Is this data one row per user, per session, or per event?"
|
|
4
|
+
Missing Filter,Forgot to exclude test users, internal accounts, or invalid data,"Numbers include test data; higher than expected","SELECT COUNT(*) FROM users WHERE email LIKE '%@company.com' OR email LIKE '%test%'","Should we exclude internal/test users? Any known filters?"
|
|
5
|
+
Timezone Mismatch,Comparing dates in different timezones causing misalignment,"Day totals don't match other reports; off-by-one errors","SELECT DISTINCT date_trunc('day', ts AT TIME ZONE 'UTC') vs AT TIME ZONE 'PST'","What timezone should I use for date calculations?"
|
|
6
|
+
Survivorship Bias,Only analyzing users who completed journey ignoring dropoffs,"Metrics look too good; missing failed attempts","Check: are we only looking at users who converted?","Are we analyzing all users or only those who [completed action]?"
|
|
7
|
+
Simpson's Paradox,Aggregate trend opposite of subgroup trends,"Conflicting conclusions; unexpected direction","Compare aggregate vs segment-level trends","Should we break this down by [segment] to check for hidden patterns?"
|
|
8
|
+
Incomplete Time Period,Comparing full period to partial period,"Latest period looks lower than historical","Check if latest period has full data: WHERE date < current_date","Is the latest time period complete, or should we exclude it?"
|
|
9
|
+
Wrong Join Type,Using INNER when LEFT needed or vice versa,"Missing rows; unexpected nulls; row count changes","Compare row counts before and after join","The join produced X rows from Y original rows. Does this match expectation?"
|
|
10
|
+
Null Handling Errors,NULLs excluded from aggregations unexpectedly,"Lower counts than expected; divisions by zero","SELECT COUNT(*), COUNT(column), SUM(CASE WHEN column IS NULL THEN 1 END)","How should we handle missing/null values in this analysis?"
|
|
11
|
+
Off-by-One Date Errors,BETWEEN includes endpoints; wrong date boundary,"One extra or missing day; period mismatch","Check: date >= start AND date < end (exclusive end)","Should the date range include or exclude the end date?"
|
|
12
|
+
Metric Definition Mismatch,Using different definition than stakeholder expects,"Numbers don't match expectations; confusion","Document exact definition before starting","How does your team define [metric]? What's included/excluded?"
|
|
13
|
+
Currency Unit Confusion,Mixing dollars and cents or different currencies,"Numbers off by factor of 100 or exchange rate","Check: are amounts in dollars or cents? One currency?","Are these amounts in dollars or cents? Same currency throughout?"
|
|
14
|
+
Seasonality Ignored,Comparing periods with different seasonal patterns,"Invalid conclusions; unfair comparisons","Compare same period last year, not sequential periods","Should we compare to same period last year to account for seasonality?"
|
|
15
|
+
Selection Bias,Analyzing non-representative sample,"Conclusions don't generalize; biased insights","Check how sample was selected; compare to population","Is this sample representative of all users, or a specific subset?"
|
|
16
|
+
Correlation vs Causation,Claiming causation from correlation,"Incorrect business recommendations","Check: is there a plausible mechanism? Control for confounders?","Does X actually cause Y, or are they just correlated?"
|
|
17
|
+
Cherry Picking Dates,Choosing date range that shows desired narrative,"Misleading conclusions; not reproducible","Use standard reporting periods; document why dates chosen","Why was this specific date range chosen?"
|
|
18
|
+
Aggregation Level Mismatch,Comparing metrics calculated at different levels,"Apples to oranges comparison; invalid conclusions","Ensure both metrics use same denominator/level","Are both these metrics calculated the same way (same level)?"
|
|
19
|
+
Data Latency Issues,Using stale data that hasn't propagated fully,"Recent periods look incomplete; inconsistent","Check data freshness: MAX(updated_at), pipeline completion","Is this data fully loaded? When was it last updated?"
|
|
20
|
+
Calculation Errors,Wrong formula for complex metrics,"Metrics don't match known correct values","Validate against known correct calculation or source","Can we validate this against another source or manual calculation?"
|
|
21
|
+
Presentation Bias,Chart design exaggerating or hiding patterns,"Misleading visualizations; wrong conclusions","Check: y-axis starts at zero? Scale appropriate?","Does this chart accurately represent the data without distortion?"
|
|
@@ -0,0 +1,51 @@
|
|
|
1
|
+
Workflow Name,Step Number,Step Name,Description,Questions to Ask,Tools/Commands,Outputs,Common Mistakes
|
|
2
|
+
Exploratory Data Analysis,1,Define Objectives,Understand what insights are needed,"What business questions should this EDA answer? Who is the primary audience for these findings?","None - conversation with stakeholder","Clear list of questions to answer","Starting analysis without clear goals"
|
|
3
|
+
Exploratory Data Analysis,2,Data Profiling,Understand data structure shape and types,"How many rows do you expect? What date range should I focus on?","df.info(), df.describe(), df.isnull().sum(), df.dtypes","Data profile report with shape types and missing values","Skipping profiling and diving straight into analysis"
|
|
4
|
+
Exploratory Data Analysis,3,Univariate Analysis,Analyze individual columns distributions,"Are there any columns I should focus on specifically?","df.hist(), df.value_counts(), df.describe()","Histograms and value distributions for key columns","Not checking for outliers or unexpected values"
|
|
5
|
+
Exploratory Data Analysis,4,Bivariate Analysis,Relationships between variables,"Which relationships are most important to understand?","df.corr(), scatter plots, grouped statistics","Correlation matrix and scatter plots showing relationships","Missing important correlations by not testing all pairs"
|
|
6
|
+
Exploratory Data Analysis,5,Document Findings,Summarize insights,"What format do you prefer for the findings summary?","Markdown report generation","Summary report with key insights and recommendations","Not prioritizing findings by business impact"
|
|
7
|
+
Dashboard Creation,1,Define Audience,Who will use the dashboard and for what purpose,"Is this for executives (high-level KPIs) or analysts (detailed breakdowns)? How often will they view it?","None - conversation","Clear audience definition and use case","Building for wrong audience (too detailed for execs or too simple for analysts)"
|
|
8
|
+
Dashboard Creation,2,Identify KPIs,What metrics matter most to track,"What are your top 5-7 metrics? Do you have targets for each?","Search industry metrics database","Prioritized list of KPIs with targets","Too many metrics (7+ KPIs causes cognitive overload)"
|
|
9
|
+
Dashboard Creation,3,Data Preparation,Get data into usable format,"Which tables contain this data? What granularity (daily/weekly)?","SQL queries, pandas transformations","Clean aggregated data ready for visualization","Not validating data before visualization"
|
|
10
|
+
Dashboard Creation,4,Chart Selection,Choose appropriate visualizations,"Any chart preferences? Need to support mobile viewing?","Search chart database","Chart type selected for each KPI","Using pie charts for more than 5 categories"
|
|
11
|
+
Dashboard Creation,5,Layout Design,Arrange components following best practices,"Should I follow inverted pyramid (KPIs top trends middle details bottom)?","Dashboard layout template","Final dashboard layout","Burying key insights at bottom of page"
|
|
12
|
+
A/B Test Analysis,1,Define Hypothesis,What are we testing and what do we expect,"What is the primary metric? What is the minimum detectable effect you care about?","None - conversation","Clear null and alternative hypothesis documented","Not defining success criteria upfront"
|
|
13
|
+
A/B Test Analysis,2,Check Sample Size,Sufficient data for statistical significance,"How long has the test been running? What is baseline conversion rate?","Power analysis calculator","Required vs actual sample size comparison","Stopping test too early (peeking problem)"
|
|
14
|
+
A/B Test Analysis,3,Validation Checks,Data quality and test validity,"Were users randomly assigned? Any known issues with test setup?","SRM check, novelty effect detection","Test validity report","Ignoring Sample Ratio Mismatch (SRM)"
|
|
15
|
+
A/B Test Analysis,4,Statistical Analysis,Calculate significance and effect size,"What confidence level is required (95% or 99%)?","t-test, chi-square, confidence interval calculation","P-value, confidence interval, effect size","Not accounting for multiple comparisons"
|
|
16
|
+
A/B Test Analysis,5,Interpret Results,What does this mean for business,"Should we roll out, iterate, or abandon based on results?","Business impact calculation","Actionable recommendation with expected impact","Declaring winner without considering practical significance"
|
|
17
|
+
Cohort Analysis,1,Define Cohort,How to group users for analysis,"Should I cohort by signup date, first purchase, or another event?","None - conversation","Cohort definition documented","Using wrong cohort definition for the question"
|
|
18
|
+
Cohort Analysis,2,Define Metric,What to measure over time,"Should I track retention, revenue, or activity? Over what time periods?","None - conversation","Metric and time periods defined","Measuring wrong metric for the business question"
|
|
19
|
+
Cohort Analysis,3,Build Cohort Table,SQL for cohort pivot table,"Is there a specific date range to analyze?","SQL with window functions, pivot tables","Cohort table with periods as columns","Off-by-one errors in period calculations"
|
|
20
|
+
Cohort Analysis,4,Visualize,Create retention heatmap,"Any specific cohorts to highlight?","Heatmap visualization","Color-coded retention heatmap","Using colors that don't show progression clearly"
|
|
21
|
+
Cohort Analysis,5,Insights,Identify patterns and explain why,"Which cohorts performed best/worst?","Comparative analysis","Insights report with recommended actions","Not investigating WHY cohorts differ"
|
|
22
|
+
Funnel Analysis,1,Define Steps,What are the funnel stages in order,"What is the first step? What is the final conversion event?","None - conversation","Ordered list of funnel steps","Missing steps or having steps out of order"
|
|
23
|
+
Funnel Analysis,2,Count Users,How many users at each step,"What time window should I use for the funnel?","SQL to count distinct users per step","User counts at each stage","Counting sessions instead of unique users"
|
|
24
|
+
Funnel Analysis,3,Calculate Drop-off,Where are users leaving,"Are there any known issues at specific steps?","Conversion rate between steps","Drop-off rates between each step","Comparing non-sequential steps"
|
|
25
|
+
Funnel Analysis,4,Visualize,Create funnel chart,"Prefer horizontal bars or funnel shape?","Funnel or horizontal bar visualization","Funnel visualization","Not labeling percentages clearly"
|
|
26
|
+
Funnel Analysis,5,Recommendations,How to improve conversion,"What lever do you have to improve each step?","Analysis of biggest opportunities","Prioritized list of improvement suggestions","Focusing on small improvements instead of biggest drop-offs"
|
|
27
|
+
Time Series Analysis,1,Define Metric,What to analyze over time,"Daily revenue, weekly users, or monthly orders? How far back?","None - conversation","Metric and time range defined","Wrong granularity (too granular hides trends or too aggregated misses patterns)"
|
|
28
|
+
Time Series Analysis,2,Aggregate,Group by time period,"Any specific date filters? Exclude weekends?","SQL with DATE_TRUNC, GROUP BY","Aggregated time series data","Timezone issues in date grouping"
|
|
29
|
+
Time Series Analysis,3,Decompose,Identify trend seasonality residual,"Is there known seasonality (weekly/monthly/yearly)?","Seasonal decomposition, moving averages","Decomposed components visualization","Ignoring seasonality when comparing periods"
|
|
30
|
+
Time Series Analysis,4,Compare Periods,YoY MoM WoW comparisons,"Which comparison periods matter most?","LAG functions, period-over-period calculations","Comparison table with growth rates","Comparing incomplete periods"
|
|
31
|
+
Time Series Analysis,5,Forecast (optional),Predict future values,"Do you need forecasting? What horizon?","Simple forecasting models","Forecast with confidence intervals","Overfitting on historical data"
|
|
32
|
+
Customer Segmentation,1,Define Variables,What to segment on,"RFM, behavior, or demographics? What actions should differ by segment?","None - conversation","Segmentation variables defined","Choosing variables that don't drive different actions"
|
|
33
|
+
Customer Segmentation,2,Feature Engineering,Calculate segment variables,"What time window for calculating features?","SQL or Python for RFM or other features","Feature table ready for segmentation","Using raw values instead of normalized scores"
|
|
34
|
+
Customer Segmentation,3,Clustering,Group similar customers,"How many segments should we create?","K-means or rule-based segmentation","Cluster assignments","Too many or too few segments"
|
|
35
|
+
Customer Segmentation,4,Profile Segments,Describe each group characteristics,"Which metrics matter most for describing segments?","Aggregate statistics per segment","Segment profile table","Not validating segments are actionable"
|
|
36
|
+
Customer Segmentation,5,Actionable Names,Name the segments memorably,"Any naming conventions to follow?","Creative naming","Named segments (e.g., Champions, At Risk)","Generic names that don't inspire action"
|
|
37
|
+
Data Cleaning Pipeline,1,Profiling,Understand data quality issues,"What quality issues are you aware of?","df.isnull().sum(), df.duplicated().sum()","Data quality report","Assuming data is clean without checking"
|
|
38
|
+
Data Cleaning Pipeline,2,Missing Values,Handle nulls appropriately,"Can I drop rows with missing data or should I impute?","fillna(), dropna(), imputation strategies","Data with handled missing values","Using wrong imputation strategy (mean for skewed data)"
|
|
39
|
+
Data Cleaning Pipeline,3,Duplicates,Remove redundant rows,"What makes a row a duplicate (exact match or by key)?","drop_duplicates(), deduplication logic","Deduplicated data","Removing wrong duplicates (losing valid data)"
|
|
40
|
+
Data Cleaning Pipeline,4,Outliers,Handle extreme values,"Should outliers be removed, capped, or kept?","IQR, Z-score detection, capping","Data with handled outliers","Removing outliers that are valid data points"
|
|
41
|
+
Data Cleaning Pipeline,5,Validation,Verify clean data meets expectations,"What validation checks should pass?","Assertions, before/after comparison","Validation report confirming data quality","Not comparing before/after statistics"
|
|
42
|
+
Ad-hoc Query Analysis,1,Clarify Question,What exactly do they need,"Can you give me an example of the desired output format?","None - conversation","Clear requirements documented","Assuming you understand without confirming"
|
|
43
|
+
Ad-hoc Query Analysis,2,Identify Tables,Where is the data located,"Which database/schema/tables contain this data? Any documentation?","Schema exploration, data dictionary","Table and column mapping","Joining wrong tables or using outdated sources"
|
|
44
|
+
Ad-hoc Query Analysis,3,Write Query,Draft SQL or Python code,"None - writing code","SQL or Python script","Working query with explanation","Not explaining the logic behind complex queries"
|
|
45
|
+
Ad-hoc Query Analysis,4,Validate,Check results make sense,"Does the output look correct? Check this sample.","Sample verification, total checks","Validated results","Delivering without sanity checking totals"
|
|
46
|
+
Ad-hoc Query Analysis,5,Iterate,Refine based on feedback,"Does this answer your question? Any adjustments needed?","Query modifications","Final refined query and results","Not iterating when initial results are wrong"
|
|
47
|
+
KPI Reporting,1,Define KPIs,Which metrics to report,"What are the most important KPIs for this report? Any targets?","Search industry metrics","Selected KPIs with targets","Too many KPIs dilute focus"
|
|
48
|
+
KPI Reporting,2,Calculate,Compute current values,"What time period should I calculate for?","SQL for each KPI calculation","Current KPI values","Calculation errors in complex KPIs"
|
|
49
|
+
KPI Reporting,3,Compare,vs previous period or target,"Compare to last period, last year, or target?","YoY, MoM, vs goal calculations","Comparison table with deltas","Comparing to wrong baseline"
|
|
50
|
+
KPI Reporting,4,Format,Create readable report,"Prefer table, cards, or dashboard format?","Report formatting","Formatted KPI report","Poor formatting reduces readability"
|
|
51
|
+
KPI Reporting,5,Highlight,What needs attention,"What threshold triggers a red flag?","Conditional formatting, alerts","Highlighted issues needing action","Not drawing attention to problems"
|