PyPI - databricks-tpcds - Versions diffs - 0.1.0__tar.gz - Mend

databricks-tpcds 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (121) hide show

databricks_tpcds-0.1.0/MANIFEST.in ADDED Viewed

	@@ -0,0 +1 @@
1	+ recursive-include src/resources/queries *.sql

databricks_tpcds-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,101 @@
+Metadata-Version: 2.4
+Name: databricks-tpcds
+Version: 0.1.0
+Summary: Run the TPC-DS benchmark on Databricks (Delta Lake).
+Home-page: https://github.com/onehouseinc/onebench
+Author: Onehouse
+License: Apache-2.0
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+Dynamic: author
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: license
+Dynamic: requires-python
+Dynamic: summary
+## Running TPCDS on Databricks
+This document describes how to run TPCDS on Databricks. The TPCDS benchmark is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general purpose decision support system. The benchmark is the result of a partnership between the Transaction Processing Performance Council (TPC) and the decision support group (DS) of the Association for Computing Machinery (ACM).
+### Pre-requisites
+1. Databricks workspace
+2. Databricks metastore configured to workspace
+3. Databricks cluster (jobs/all purpose etc)
+## Install from PyPI
+Install the package directly in a Databricks notebook:
+```shell
+%pip install databricks-tpcds
+```
+The package provides the `DatabricksTPCDS` library. You drive it from an entrypoint script like
+the Delta Lake example below.
+## Delta Lake entrypoint example
+Fill in the placeholder `catalog_name`, `bucket_name`, `prefix`, and `schema_name` with your own
+values, then run it on your Databricks cluster.
+```python
+from pyspark.sql import SparkSession
+from databricks_tpcds.databricks_tpcds import DatabricksTPCDS
+def main():
+    catalog_name = 'my_catalog'
+    bucket_name = 'my-bucket'
+    prefix = 'path/to/tpcds-datasets/1TB'
+    schema_name = 'my_schema'
+    # Initialize Spark session
+    spark = SparkSession.builder.appName("TPCDS Query Runner").getOrCreate()
+    # Enable/disable cache
+    spark.conf.set("spark.databricks.io.cache.enabled", "false")
+    databricks_tpcds = DatabricksTPCDS(spark, schema_name=schema_name, catalog_name=catalog_name)
+    # Create catalog
+    databricks_tpcds.create_catalog()
+    # Create schema
+    databricks_tpcds.create_schema()
+    # Create a single table, provide the table name
+    # databricks_tpcds.create_table(bucket_name, prefix, "call_center")
+    # Create multiple tables, provide the list of table names
+    # databricks_tpcds.create_tables(bucket_name, prefix, ["call_center", "catalog_page"])
+    # Create all tables, provide the bucket name and prefix, it'll create all the tables
+    databricks_tpcds.create_all_tables(bucket_name, prefix)
+    # Run all queries
+    for i in range(3):
+        time_taken_by_queries = databricks_tpcds.run_all_queries(should_warmup=False)
+        print("QUERY_NUMBER,TIME_TAKEN")
+        for query_no, time_taken in time_taken_by_queries.items():
+            print(f"{query_no},{time_taken}")
+if __name__ == "__main__":
+    main()
+```
+## Developing locally
+1. Modify the code if necessary in `src/databricks_tpcds/databricks_tpcds.py`
+2. Take a look or modify the queries in `src/resources/queries/`
+3. Build the package:
+```shell
+cd tpcds/databricks
+python3.10 -m build
+```
+4. Upload the built `.whl` to your Databricks workspace and install it in a notebook:
+```shell
+%pip install path/to/databricks_tpcds-0.1.0-py3-none-any.whl --force-reinstall
+```
+5. Run the benchmark using the Delta Lake entrypoint example above.

databricks_tpcds-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,80 @@
+## Running TPCDS on Databricks
+This document describes how to run TPCDS on Databricks. The TPCDS benchmark is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general purpose decision support system. The benchmark is the result of a partnership between the Transaction Processing Performance Council (TPC) and the decision support group (DS) of the Association for Computing Machinery (ACM).
+### Pre-requisites
+1. Databricks workspace
+2. Databricks metastore configured to workspace
+3. Databricks cluster (jobs/all purpose etc)
+## Install from PyPI
+Install the package directly in a Databricks notebook:
+```shell
+%pip install databricks-tpcds
+```
+The package provides the `DatabricksTPCDS` library. You drive it from an entrypoint script like
+the Delta Lake example below.
+## Delta Lake entrypoint example
+Fill in the placeholder `catalog_name`, `bucket_name`, `prefix`, and `schema_name` with your own
+values, then run it on your Databricks cluster.
+```python
+from pyspark.sql import SparkSession
+from databricks_tpcds.databricks_tpcds import DatabricksTPCDS
+def main():
+    catalog_name = 'my_catalog'
+    bucket_name = 'my-bucket'
+    prefix = 'path/to/tpcds-datasets/1TB'
+    schema_name = 'my_schema'
+    # Initialize Spark session
+    spark = SparkSession.builder.appName("TPCDS Query Runner").getOrCreate()
+    # Enable/disable cache
+    spark.conf.set("spark.databricks.io.cache.enabled", "false")
+    databricks_tpcds = DatabricksTPCDS(spark, schema_name=schema_name, catalog_name=catalog_name)
+    # Create catalog
+    databricks_tpcds.create_catalog()
+    # Create schema
+    databricks_tpcds.create_schema()
+    # Create a single table, provide the table name
+    # databricks_tpcds.create_table(bucket_name, prefix, "call_center")
+    # Create multiple tables, provide the list of table names
+    # databricks_tpcds.create_tables(bucket_name, prefix, ["call_center", "catalog_page"])
+    # Create all tables, provide the bucket name and prefix, it'll create all the tables
+    databricks_tpcds.create_all_tables(bucket_name, prefix)
+    # Run all queries
+    for i in range(3):
+        time_taken_by_queries = databricks_tpcds.run_all_queries(should_warmup=False)
+        print("QUERY_NUMBER,TIME_TAKEN")
+        for query_no, time_taken in time_taken_by_queries.items():
+            print(f"{query_no},{time_taken}")
+if __name__ == "__main__":
+    main()
+```
+## Developing locally
+1. Modify the code if necessary in `src/databricks_tpcds/databricks_tpcds.py`
+2. Take a look or modify the queries in `src/resources/queries/`
+3. Build the package:
+```shell
+cd tpcds/databricks
+python3.10 -m build
+```
+4. Upload the built `.whl` to your Databricks workspace and install it in a notebook:
+```shell
+%pip install path/to/databricks_tpcds-0.1.0-py3-none-any.whl --force-reinstall
+```
+5. Run the benchmark using the Delta Lake entrypoint example above.

databricks_tpcds-0.1.0/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

databricks_tpcds-0.1.0/setup.py ADDED Viewed

@@ -0,0 +1,31 @@
+import os
+from setuptools import setup, find_packages
+this_dir = os.path.abspath(os.path.dirname(__file__))
+with open(os.path.join(this_dir, "README.md"), encoding="utf-8") as f:
+    long_description = f.read()
+setup(
+    name='databricks-tpcds',
+    version='0.1.0',
+    description='Run the TPC-DS benchmark on Databricks (Delta Lake).',
+    long_description=long_description,
+    long_description_content_type='text/markdown',
+    author='Onehouse',
+    url='https://github.com/onehouseinc/onebench',
+    license='Apache-2.0',
+    python_requires='>=3.9',
+    packages=find_packages(where='src'),
+    package_dir={'': 'src'},
+    install_requires=[],
+    include_package_data=True,
+    package_data={
+        "resources": ["queries/*.sql"]  # Ensure SQL query files are included
+    },
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "License :: OSI Approved :: Apache Software License",
+        "Operating System :: OS Independent",
+    ],
+)

databricks_tpcds-0.1.0/src/databricks_tpcds/__init__.py ADDED Viewed

@@ -0,0 +1,9 @@
+import os
+import sys
+# Ensure that the package can find resources
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+# Expose modules so they can be imported directly
+from resources import table_names
+from .databricks_tpcds import DatabricksTPCDS

databricks_tpcds-0.1.0/src/databricks_tpcds/databricks_tpcds.py ADDED Viewed

@@ -0,0 +1,153 @@
+import time
+from resources import table_names
+import importlib.resources as pkg_resources
+class DatabricksTPCDS:
+    """
+    A class to interact with Databricks for TPC-DS benchmarking.
+    Attributes
+    ----------
+    spark : pyspark.sql.session.SparkSession
+        The SparkSession used to interact with Databricks.
+    catalog_name : str
+        The name of the catalog in Databricks.
+    time_taken_by_query : dict
+        A dictionary to store the time taken by each query.
+    Methods
+    -------
+    use_catalog() -> None:
+        Uses the catalog in Databricks.
+    create_catalog() -> None:
+        Creates a new catalog in Databricks.
+    create_table(bucket_name: str, prefix: str, table_name: str) -> None:
+        Creates a new table in the Databricks catalog.
+    create_tables(bucket_name: str, prefix: str, table_names: list[str]) -> None:
+        Creates multiple tables in the Databricks catalog.
+    create_all_tables(bucket_name: str, prefix: str) -> None:
+        Creates all tables in the Databricks catalog.
+    run_query(query_num: int) -> float:
+        Runs a TPC-DS query and returns the total execution time.
+    run_queries(query_nums: list[int]) -> dict[int, float]:
+        Runs multiple TPC-DS queries and returns a dictionary of total execution times.
+    run_all_queries() -> dict[int, float]:
+        Runs all TPC-DS queries and returns a dictionary of total execution times.
+    """
+    order_by_cols: dict[str, str] = {
+        "call_center": "cc_call_center_id",
+        "catalog_page": "cp_catalog_page_id",
+        "catalog_returns": "cr_returned_date_sk",
+        "catalog_sales": "cs_sold_date_sk",
+        "customer": "c_customer_id",
+        "customer_address": "ca_address_id",
+        "customer_demographics": "cd_demo_sk",
+        "date_dim": "d_date_id",
+        "household_demographics": "hd_demo_sk",
+        "income_band": "ib_income_band_sk",
+        "inventory": "inv_item_sk",
+        "item": "i_item_id",
+        "promotion": "p_promo_id",
+        "reason": "r_reason_id",
+        "ship_mode": "sm_ship_mode_id",
+        "store": "s_store_id",
+        "store_returns": "sr_returned_date_sk",
+        "store_sales": "ss_sold_date_sk",
+        "time_dim": "t_time_id",
+        "warehouse": "w_warehouse_id",
+        "web_page": "wp_web_page_id",
+        "web_returns": "wr_returned_date_sk",
+        "web_sales": "ws_sold_date_sk",
+        "web_site": "web_site_id"
+    }
+    def __init__(self, spark, schema_name, catalog_name=None):
+        self.spark = spark
+        self.catalog_name = catalog_name
+        self.schema_name = schema_name
+        # self.enable_cache = enable_cache
+        self.time_taken_by_query = {}
+        print(f"Disk cache enabled: {self.spark.conf.get('spark.databricks.io.cache.enabled')}")
+    # def enable_disk_cache(self) -> None:
+    #     if self.enable_cache == True:
+    #         self.spark.conf.set("spark.databricks.io.cache.enabled", "true")
+    #     elif self.enable_cache == False:
+    #         self.spark.conf.set("spark.databricks.io.cache.enabled", "false")
+    def use_catalog(self) -> None:
+        if self.catalog_name:
+            self.spark.sql(f"USE CATALOG {self.catalog_name}")
+    def use_schema(self) -> None:
+        self.spark.sql(f"USE SCHEMA {self.schema_name}")
+    def create_catalog(self) -> None:
+        if self.catalog_name:
+            self.spark.sql(f"CREATE CATALOG IF NOT EXISTS {self.catalog_name}")
+    def create_schema(self) -> None:
+        self.use_catalog()
+        self.spark.sql(f"CREATE SCHEMA IF NOT EXISTS {self.schema_name}")
+    def create_table(self, bucket_name: str, prefix: str, table_name: str, table_format: str = "delta") -> None:
+        table_path = f"s3://{bucket_name}/{prefix}/{table_name}"
+        self.use_catalog()
+        self.use_schema()
+        self.spark.sql(f"CREATE TABLE IF NOT EXISTS {table_name} USING {table_format} LOCATION '{table_path}'")
+        print(f"Table {table_name} created successfully using {table_format} format.")
+    def create_tables(self, bucket_name: str, prefix: str, table_names: list[str], table_format: str = "delta") -> None:
+        for table in table_names:
+            self.create_table(bucket_name, prefix, table, table_format)
+    def create_all_tables(self, bucket_name: str, prefix: str, table_format: str = "delta") -> None:
+        self.create_tables(bucket_name, prefix, table_names.TABLE_NAMES, table_format)
+    def warm_up(self) -> None:
+        self.use_catalog()
+        self.use_schema()
+        for table in table_names.TABLE_NAMES:
+            order_by_col = self.order_by_cols[table]
+            self.spark.sql(f"SELECT * FROM {table} ORDER BY {order_by_col} LIMIT 100").collect()
+    def run_query(self, query_num: str) -> float:
+        self.use_catalog()
+        self.use_schema()
+        query_filename = f"q{query_num}.sql"
+        try:
+            with pkg_resources.open_text("resources.queries", query_filename) as query:
+                query_desc = f"q{query_num}"
+                print(query_desc)
+                query_string = query.read()
+                start_time = time.time()
+                self.spark.sparkContext.setJobGroup(query_desc, query_desc, interruptOnCancel=True)
+                self.spark.sql(query_string).collect()
+                end_time = time.time()
+                time_taken = (end_time - start_time) * 1000
+                self.time_taken_by_query[query_num] = time_taken
+                return round(time_taken)
+        except FileNotFoundError:
+            print(f"Query file {query_filename} not found in package.")
+            return -1
+    def run_queries(self, query_nums: list[int]) -> dict[int, float]:
+        time_taken_by_queries = {}
+        for query_num in query_nums:
+            time_taken_by_queries[query_num] = self.run_query(str(query_num))
+        return time_taken_by_queries
+    def run_all_queries(self, should_warmup=False) -> dict[int, float]:
+        if should_warmup:
+            self.warm_up()
+        time_taken_by_queries = {}
+        for query_num in range(1, 100):
+            if query_num in [14, 23, 24, 39]:
+                for subqueries in ["a", "b"]:
+                    time_taken_by_queries[f"{query_num}{subqueries}"] = self.run_query(f"{query_num}{subqueries}")
+            else:
+                time_taken_by_queries[query_num] = self.run_query(str(query_num))
+        return time_taken_by_queries

databricks_tpcds-0.1.0/src/databricks_tpcds.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,101 @@
+Metadata-Version: 2.4
+Name: databricks-tpcds
+Version: 0.1.0
+Summary: Run the TPC-DS benchmark on Databricks (Delta Lake).
+Home-page: https://github.com/onehouseinc/onebench
+Author: Onehouse
+License: Apache-2.0
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+Dynamic: author
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: license
+Dynamic: requires-python
+Dynamic: summary
+## Running TPCDS on Databricks
+This document describes how to run TPCDS on Databricks. The TPCDS benchmark is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general purpose decision support system. The benchmark is the result of a partnership between the Transaction Processing Performance Council (TPC) and the decision support group (DS) of the Association for Computing Machinery (ACM).
+### Pre-requisites
+1. Databricks workspace
+2. Databricks metastore configured to workspace
+3. Databricks cluster (jobs/all purpose etc)
+## Install from PyPI
+Install the package directly in a Databricks notebook:
+```shell
+%pip install databricks-tpcds
+```
+The package provides the `DatabricksTPCDS` library. You drive it from an entrypoint script like
+the Delta Lake example below.
+## Delta Lake entrypoint example
+Fill in the placeholder `catalog_name`, `bucket_name`, `prefix`, and `schema_name` with your own
+values, then run it on your Databricks cluster.
+```python
+from pyspark.sql import SparkSession
+from databricks_tpcds.databricks_tpcds import DatabricksTPCDS
+def main():
+    catalog_name = 'my_catalog'
+    bucket_name = 'my-bucket'
+    prefix = 'path/to/tpcds-datasets/1TB'
+    schema_name = 'my_schema'
+    # Initialize Spark session
+    spark = SparkSession.builder.appName("TPCDS Query Runner").getOrCreate()
+    # Enable/disable cache
+    spark.conf.set("spark.databricks.io.cache.enabled", "false")
+    databricks_tpcds = DatabricksTPCDS(spark, schema_name=schema_name, catalog_name=catalog_name)
+    # Create catalog
+    databricks_tpcds.create_catalog()
+    # Create schema
+    databricks_tpcds.create_schema()
+    # Create a single table, provide the table name
+    # databricks_tpcds.create_table(bucket_name, prefix, "call_center")
+    # Create multiple tables, provide the list of table names
+    # databricks_tpcds.create_tables(bucket_name, prefix, ["call_center", "catalog_page"])
+    # Create all tables, provide the bucket name and prefix, it'll create all the tables
+    databricks_tpcds.create_all_tables(bucket_name, prefix)
+    # Run all queries
+    for i in range(3):
+        time_taken_by_queries = databricks_tpcds.run_all_queries(should_warmup=False)
+        print("QUERY_NUMBER,TIME_TAKEN")
+        for query_no, time_taken in time_taken_by_queries.items():
+            print(f"{query_no},{time_taken}")
+if __name__ == "__main__":
+    main()
+```
+## Developing locally
+1. Modify the code if necessary in `src/databricks_tpcds/databricks_tpcds.py`
+2. Take a look or modify the queries in `src/resources/queries/`
+3. Build the package:
+```shell
+cd tpcds/databricks
+python3.10 -m build
+```
+4. Upload the built `.whl` to your Databricks workspace and install it in a notebook:
+```shell
+%pip install path/to/databricks_tpcds-0.1.0-py3-none-any.whl --force-reinstall
+```
+5. Run the benchmark using the Delta Lake entrypoint example above.

databricks_tpcds-0.1.0/src/databricks_tpcds.egg-info/SOURCES.txt ADDED Viewed

@@ -0,0 +1,119 @@
+MANIFEST.in
+README.md
+setup.py
+src/databricks_tpcds/__init__.py
+src/databricks_tpcds/databricks_tpcds.py
+src/databricks_tpcds.egg-info/PKG-INFO
+src/databricks_tpcds.egg-info/SOURCES.txt
+src/databricks_tpcds.egg-info/dependency_links.txt
+src/databricks_tpcds.egg-info/top_level.txt
+src/resources/__init__.py
+src/resources/table_names.py
+src/resources/queries/q0.sql
+src/resources/queries/q1.sql
+src/resources/queries/q10.sql
+src/resources/queries/q11.sql
+src/resources/queries/q12.sql
+src/resources/queries/q13.sql
+src/resources/queries/q14.sql
+src/resources/queries/q14a.sql
+src/resources/queries/q14b.sql
+src/resources/queries/q15.sql
+src/resources/queries/q16.sql
+src/resources/queries/q17.sql
+src/resources/queries/q18.sql
+src/resources/queries/q19.sql
+src/resources/queries/q2.sql
+src/resources/queries/q20.sql
+src/resources/queries/q21.sql
+src/resources/queries/q22.sql
+src/resources/queries/q23.sql
+src/resources/queries/q23a.sql
+src/resources/queries/q23b.sql
+src/resources/queries/q24.sql
+src/resources/queries/q24a.sql
+src/resources/queries/q24b.sql
+src/resources/queries/q25.sql
+src/resources/queries/q26.sql
+src/resources/queries/q27.sql
+src/resources/queries/q28.sql
+src/resources/queries/q29.sql
+src/resources/queries/q3.sql
+src/resources/queries/q30.sql
+src/resources/queries/q31.sql
+src/resources/queries/q32.sql
+src/resources/queries/q33.sql
+src/resources/queries/q34.sql
+src/resources/queries/q35.sql
+src/resources/queries/q36.sql
+src/resources/queries/q37.sql
+src/resources/queries/q38.sql
+src/resources/queries/q39.sql
+src/resources/queries/q39a.sql
+src/resources/queries/q39b.sql
+src/resources/queries/q4.sql
+src/resources/queries/q40.sql
+src/resources/queries/q41.sql
+src/resources/queries/q42.sql
+src/resources/queries/q43.sql
+src/resources/queries/q44.sql
+src/resources/queries/q45.sql
+src/resources/queries/q46.sql
+src/resources/queries/q47.sql
+src/resources/queries/q48.sql
+src/resources/queries/q49.sql
+src/resources/queries/q5.sql
+src/resources/queries/q50.sql
+src/resources/queries/q51.sql
+src/resources/queries/q52.sql
+src/resources/queries/q53.sql
+src/resources/queries/q54.sql
+src/resources/queries/q55.sql
+src/resources/queries/q56.sql
+src/resources/queries/q57.sql
+src/resources/queries/q58.sql
+src/resources/queries/q59.sql
+src/resources/queries/q6.sql
+src/resources/queries/q60.sql
+src/resources/queries/q61.sql
+src/resources/queries/q62.sql
+src/resources/queries/q63.sql
+src/resources/queries/q64.sql
+src/resources/queries/q65.sql
+src/resources/queries/q66.sql
+src/resources/queries/q67.sql
+src/resources/queries/q68.sql
+src/resources/queries/q69.sql
+src/resources/queries/q7.sql
+src/resources/queries/q70.sql
+src/resources/queries/q71.sql
+src/resources/queries/q72.sql
+src/resources/queries/q73.sql
+src/resources/queries/q74.sql
+src/resources/queries/q75.sql
+src/resources/queries/q76.sql
+src/resources/queries/q77.sql
+src/resources/queries/q78.sql
+src/resources/queries/q79.sql
+src/resources/queries/q8.sql
+src/resources/queries/q80.sql
+src/resources/queries/q81.sql
+src/resources/queries/q82.sql
+src/resources/queries/q83.sql
+src/resources/queries/q84.sql
+src/resources/queries/q85.sql
+src/resources/queries/q86.sql
+src/resources/queries/q87.sql
+src/resources/queries/q88.sql
+src/resources/queries/q89.sql
+src/resources/queries/q9.sql
+src/resources/queries/q90.sql
+src/resources/queries/q91.sql
+src/resources/queries/q92.sql
+src/resources/queries/q93.sql
+src/resources/queries/q94.sql
+src/resources/queries/q95.sql
+src/resources/queries/q96.sql
+src/resources/queries/q97.sql
+src/resources/queries/q98.sql
+src/resources/queries/q99.sql

databricks_tpcds-0.1.0/src/databricks_tpcds.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+

databricks_tpcds-0.1.0/src/databricks_tpcds.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ databricks_tpcds
2	+ resources

databricks_tpcds-0.1.0/src/resources/__init__.py ADDED Viewed

File without changes

databricks_tpcds-0.1.0/src/resources/queries/q0.sql ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ --TPC-DS Q0
2	+ select * from call_center limit 10;

databricks_tpcds-0.1.0/src/resources/queries/q1.sql ADDED Viewed

@@ -0,0 +1,24 @@
+--TPC-DS Q1
+with customer_total_return as
+(select sr_customer_sk as ctr_customer_sk
+,sr_store_sk as ctr_store_sk
+,sum(SR_RETURN_AMT_INC_TAX) as ctr_total_return
+from store_returns
+,date_dim
+where sr_returned_date_sk = d_date_sk
+and d_year =1999
+group by sr_customer_sk
+,sr_store_sk)
+ select  c_customer_id
+from customer_total_return ctr1
+,store
+,customer
+where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
+from customer_total_return ctr2
+where ctr1.ctr_store_sk = ctr2.ctr_store_sk)
+and s_store_sk = ctr1.ctr_store_sk
+and s_state = 'TN'
+and ctr1.ctr_customer_sk = c_customer_sk
+order by c_customer_id
+limit 100;