PyPI - noshot - Versions diffs - 0.4.1__py3-none-any.whl → 1.0.0__py3-none-any.whl - Mend

noshot 0.4.1py3-none-any.whl → 1.0.0py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

noshot/data/ML TS XAI/TS/10. Seasonal ARIMA Forecasting.ipynb +32 -714
noshot/data/ML TS XAI/TS/11. Multivariate ARIMA Forecasting.ipynb +29 -1071
noshot/data/ML TS XAI/TS/6. ACF PACF.ipynb +7 -105
noshot/data/ML TS XAI/TS/7. Differencing.ipynb +16 -152
noshot/data/ML TS XAI/TS/8. ARMA Forecasting.ipynb +26 -575
noshot/data/ML TS XAI/TS/9. ARIMA Forecasting.ipynb +23 -382
noshot/data/ML TS XAI/XAI/XAI 1/EDA2_chipsdatset.ipynb +633 -0
noshot/data/ML TS XAI/XAI/XAI 1/EDA_IRISH_8thjan.ipynb +326 -0
noshot/data/ML TS XAI/XAI/XAI 1/XAI_EX1 MODEL BIAS (FINAL).ipynb +487 -0
noshot/data/ML TS XAI/XAI/XAI 1/complete_guide_to_eda_on_text_data.ipynb +845 -0
noshot/data/ML TS XAI/XAI/XAI 1/deepchecksframeworks.ipynb +100 -0
noshot/data/ML TS XAI/XAI/XAI 1/deepexplainers (mnist).ipynb +90 -0
noshot/data/ML TS XAI/XAI/XAI 1/guidedbackpropagation.ipynb +203 -0
noshot/data/ML TS XAI/XAI/XAI 1/updated_image_EDA1_with_LRP.ipynb +3998 -0
noshot/data/ML TS XAI/XAI/XAI 1/zebrastripes.ipynb +271 -0
noshot/data/ML TS XAI/XAI/XAI 2/EXP_5.ipynb +1545 -0
noshot/data/ML TS XAI/XAI/XAI 2/Exp-3 (EDA-loan).ipynb +221 -0
noshot/data/ML TS XAI/XAI/XAI 2/Exp-3 (EDA-movie).ipynb +229 -0
noshot/data/ML TS XAI/XAI/XAI 2/Exp-4(Flower dataset).ipynb +237 -0
noshot/data/ML TS XAI/XAI/XAI 2/Exp-4.ipynb +241 -0
noshot/data/ML TS XAI/XAI/XAI 2/Exp_2.ipynb +352 -0
noshot/data/ML TS XAI/XAI/XAI 2/Exp_7.ipynb +110 -0
noshot/data/ML TS XAI/XAI/XAI 2/FeatureImportance_SensitivityAnalysis.ipynb +708 -0
{noshot-0.4.1.dist-info → noshot-1.0.0.dist-info}/METADATA +1 -1
noshot-1.0.0.dist-info/RECORD +32 -0
noshot-0.4.1.dist-info/RECORD +0 -15
{noshot-0.4.1.dist-info → noshot-1.0.0.dist-info}/WHEEL +0 -0
{noshot-0.4.1.dist-info → noshot-1.0.0.dist-info}/licenses/LICENSE.txt +0 -0
{noshot-0.4.1.dist-info → noshot-1.0.0.dist-info}/top_level.txt +0 -0

noshot/data/ML TS XAI/XAI/XAI 1/complete_guide_to_eda_on_text_data.ipynb ADDED Viewed

@@ -0,0 +1,845 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "xXn0bq0L7qmN"
+   },
+   "source": [
+    "## Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "JB9sNbP87qmO"
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "import string\n",
+    "import re\n",
+    "import nltk\n",
+    "\n",
+    "from tqdm import trange\n",
+    "from nltk import tokenize\n",
+    "from nltk.corpus import stopwords\n",
+    "from nltk.stem import WordNetLemmatizer\n",
+    "from nltk.probability import FreqDist\n",
+    "from collections import Counter\n",
+    "from sklearn.feature_extraction.text import CountVectorizer"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "5UfIxL3eXIaq"
+   },
+   "source": [
+    "\n",
+    "numpy: Useful for numerical operations and array manipulations.\n",
+    "pandas: Ideal for data manipulation and analysis using DataFrames.\n",
+    "Libraries for Visualization:\n",
+    "matplotlib.pyplot: Provides plotting capabilities for creating static, interactive, and animated visualizations.\n",
+    "seaborn: Enhances matplotlib by providing a high-level interface for drawing attractive statistical graphics.\n",
+    "Libraries for Text Processing:\n",
+    "string: Provides constants and classes for string operations.\n",
+    "re: Supports regular expression operations for pattern matching and text processing.\n",
+    "nltk (Natural Language Toolkit): A suite of libraries for natural language processing. Specific modules used here include:\n",
+    "nltk.tokenize: For splitting text into words or sentences.\n",
+    "nltk.corpus.stopwords: Provides a list of common stopwords in various languages.\n",
+    "nltk.stem.WordNetLemmatizer: For reducing words to their base or root form.\n",
+    "nltk.probability.FreqDist: Computes the frequency distribution of words or events.\n",
+    "Utility Libraries:\n",
+    "tqdm.trange: Adds a progress bar to loops, providing feedback on execution progress.\n",
+    "Data Structures and Algorithms:\n",
+    "collections.Counter: Counts occurrences of elements in an iterable, useful for frequency analysis.\n",
+    "Feature Extraction:\n",
+    "sklearn.feature_extraction.text.CountVectorizer: Converts a collection of text documents to a matrix of token counts."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "HAKJYtF8WsEO"
+   },
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "gbCBi_3a7qmP"
+   },
+   "outputs": [],
+   "source": [
+    "import warnings\n",
+    "warnings.filterwarnings('ignore') #Suppresses warning messages,\n",
+    "nltk.download('omw-1.4', quiet=True)\n",
+    "sns.set_style('darkgrid')\n",
+    "plt.rcParams['figure.figsize'] = (17,7) #Sets global parameters for Matplotlib plots, runtime configuration parameters\n",
+    "plt.rcParams['font.size'] = 18"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "iV6S7DklXTRe"
+   },
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "1OtCQoeg7qmP"
+   },
+   "source": [
+    "## Loading the Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "_kg_hide-input": false,
+    "_kg_hide-output": true,
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 363
+    },
+    "id": "fQUgnEjX7qmQ",
+    "outputId": "3f3993d0-3f0b-4636-ad3a-d8222f9554e0"
+   },
+   "outputs": [],
+   "source": [
+    "data = pd.read_csv(\"tripadvisor_hotel_reviews.csv\")\n",
+    "data.head(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "Hbc9MyUx7qmQ"
+   },
+   "source": [
+    "Now that we have our data, we can begin with the EDA.<br>**But first**, we need to transform the 'Rating' column to binary labels"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 272
+    },
+    "id": "cojecQBE7qmQ",
+    "outputId": "4c8e3d96-650c-4ab5-f9a0-e11a806f0581"
+   },
+   "outputs": [],
+   "source": [
+    "data['Rating'].value_counts() #frequency of each unique value in the Rating colum"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "BCpH9eNS7qmQ"
+   },
+   "outputs": [],
+   "source": [
+    "# rating 4, 5 => Positive; 1, 2, 3 => Negative\n",
+    "def ratings(rating):\n",
+    "    if rating>3 and rating<=5:\n",
+    "        return \"Positive\"\n",
+    "    if rating>0 and rating<=3:\n",
+    "        return \"Negative\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 576
+    },
+    "id": "qYYYGO6m7qmQ",
+    "outputId": "6c952ddf-74bf-40f0-b656-ea483acaf373"
+   },
+   "outputs": [],
+   "source": [
+    "data['Rating'] = data['Rating'].apply(ratings)# apply() method applies a function (ratings) to each element in the Rating column.\n",
+    "plt.pie(data['Rating'].value_counts(), labels=data['Rating'].unique().tolist(), autopct='%1.1f%%')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "jjKQA2B87qmQ"
+   },
+   "source": [
+    "## Exploratory Data Analysis\n",
+    "\n",
+    "### Counts and Lenght:\n",
+    "Start by checking how long the reviews are\n",
+    "* Character count\n",
+    "* Word count\n",
+    "* Mean word length\n",
+    "* Mean sentence length"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "v5PdqZqS7qmQ",
+    "outputId": "dcb71d91-3ac1-4725-a74d-269e702ac256"
+   },
+   "outputs": [],
+   "source": [
+    "lenght = len(data['Review'][0])#irst element (row) of the Review column in the DataFrame.\n",
+    "print(f'Length of a sample review: {lenght}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "ea-EoGmhzOLD"
+   },
+   "source": [
+    "nice hotel expensive parking got good deal stayed sat night because attending event hotel clean comfortable would stay again bargain price parking good central location\" , 593 characters"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "_kg_hide-output": true,
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 363
+    },
+    "id": "91biDBhs7qmR",
+    "outputId": "58786188-d992-4f32-a35a-7fcb5b875ee3"
+   },
+   "outputs": [],
+   "source": [
+    "data['Length'] = data['Review'].str.len()\n",
+    "data.head(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "4hIpfCYn7qmR"
+   },
+   "source": [
+    "#### **Word Count**: Number of words in a review"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "ux3EvuGI7qmR",
+    "outputId": "ccbf3132-694e-484d-9338-58e9fc7f4821"
+   },
+   "outputs": [],
+   "source": [
+    "word_count = data['Review'][0].split()\n",
+    "print(f'Word count in a sample review: {len(word_count)}')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "HxubAb-p7qmR"
+   },
+   "outputs": [],
+   "source": [
+    "def word_count(review):\n",
+    "    review_list = review.split()\n",
+    "    return len(review_list)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "_kg_hide-output": true,
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 363
+    },
+    "id": "sSR54vCk7qmR",
+    "outputId": "22ddfaf2-4a60-4b38-e350-2f33bc2a1cb4"
+   },
+   "outputs": [],
+   "source": [
+    "data['Word_count'] = data['Review'].apply(word_count)\n",
+    "data.head(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "S_NCkk9k7qmR"
+   },
+   "source": [
+    "#### **Mean word length**: Average length of words"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "_kg_hide-output": true,
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 380
+    },
+    "id": "FuRGrjON7qmR",
+    "outputId": "67adf4aa-a4da-4139-da44-4078ed221b91"
+   },
+   "outputs": [],
+   "source": [
+    "data['mean_word_length'] = data['Review'].map(lambda rev: np.mean([len(word) for word in rev.split()]))\n",
+    "#average length of words in each review\n",
+    "data.head(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "_SfDq14lz8vS"
+   },
+   "source": [
+    "Mean Word Length=\n",
+    "Word Count/\n",
+    "Length of the Review\n",
+    "\n",
+    "\n",
+    "For example, for the first review:\n",
+    "\n",
+    "Length of the Review: 593\n",
+    "Word Count: 87\n",
+    "Mean Word Length\n",
+    "=\n",
+    "593/\n",
+    "87\n",
+    "≈\n",
+    "5.804598\n",
+    "Mean Word Length=\n",
+    "87\n",
+    "593\n",
+    "\n",
+    " ≈5.804598"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "sMDMKTT07qmS"
+   },
+   "source": [
+    "#### **Mean sentence length**: Average length of the sentences in the review"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "StG9kd-57qmS",
+    "outputId": "2a693ea3-4b11-49c6-b6f7-a49d6d914c14"
+   },
+   "outputs": [],
+   "source": [
+    "import nltk\n",
+    "\n",
+    "nltk.download('punkt_tab')\n",
+    "\n",
+    "np.mean([len(sent) for sent in tokenize.sent_tokenize(data['Review'][0])])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "-30osJkyud2B"
+   },
+   "source": [
+    "tokenize.sent_tokenize(data['Review'][0]): Splits the first review (data['Review'][0]) into individual sentences.\n",
+    "len(sent): Calculates the number of characters in each sentence.\n",
+    "[len(sent) for sent in ...]: Creates a list of sentence lengths for the review.\n",
+    "np.mean(...): Calculates the mean (average) of the sentence lengths."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 589
+    },
+    "id": "djChgL8w7qmS",
+    "outputId": "838dce03-c1b8-42a9-d3d8-9b8a83731b50"
+   },
+   "outputs": [],
+   "source": [
+    "data['mean_sent_length'] = data['Review'].map(lambda rev: np.mean([len(sent) for sent in tokenize.sent_tokenize(rev)]))\n",
+    "data.head(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "GZ0Ab4Eq3nLw"
+   },
+   "source": [
+    "Mean Sentence Length=\n",
+    "\n",
+    "Length of the Review/Number of Sentences\n",
+    "\n",
+    " =\n",
+    "1\n",
+    "593\n",
+    "\n",
+    " =591.0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "gu3Oa0ztvP-R"
+   },
+   "source": [
+    "Row 1:\n",
+    "Sentences: [\"I love this product.\", \"It works well.\"]\n",
+    "Lengths: [20, 14]\n",
+    "Mean: (20 + 14) / 2 = 17.0\n",
+    "Row 2:\n",
+    "Sentences: [\"Not worth the price.\", \"Too expensive and low quality.\"]\n",
+    "Lengths: [21, 29]\n",
+    "Mean: (21 + 29) / 2 = 25.0\n",
+    "The mean_sent_length column will contain these averages for each review."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "tkn7Rifa7qmS"
+   },
+   "outputs": [],
+   "source": [
+    "def visualize(col):\n",
+    "\n",
+    "    print()\n",
+    "    plt.subplot(1,2,1)\n",
+    "    sns.boxplot(y=data[col], x=data['Rating']) # Changed hue to x\n",
+    "    plt.ylabel(col, labelpad=12.5)\n",
+    "\n",
+    "    plt.subplot(1,2,2)\n",
+    "    sns.kdeplot(x=data[col], hue=data['Rating']) # Changed data[col] to x=data[col]\n",
+    "    plt.legend(data['Rating'].unique())\n",
+    "    plt.xlabel('')\n",
+    "    plt.ylabel('')\n",
+    "\n",
+    "plt.show() # Moved plt.show() outside the loop\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 406
+    },
+    "id": "ngsxYq7B7qmS",
+    "outputId": "3f2df19c-4c34-4632-e053-9f32cef24ce0"
+   },
+   "outputs": [],
+   "source": [
+    "features = data.columns.tolist()[2:]\n",
+    "for feature in features:\n",
+    "    visualize(feature)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "d1We3wN-7qmS"
+   },
+   "source": [
+    "## Term Frequency Analysis\n",
+    "Examining the most frequently occuring words is one of the most popular systems of Text analytics. For example, in a sentiment analysis problem, a positive text is bound to have words like 'good', 'great', 'nice', etc. more in number than other words that imply otherwise.\n",
+    "\n",
+    "*Note*: Term Frequencies are more than counts and lenghts, so the first requirement is to preprocess the text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 206
+    },
+    "id": "X5hNJsve7qmS",
+    "outputId": "ce658f9e-c17c-4209-a422-6a401d3779c4"
+   },
+   "outputs": [],
+   "source": [
+    "df = data.drop(features, axis=1)\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "mUt47IDG7qmS",
+    "outputId": "914d80d2-c511-4a3a-eba8-780af3985eec"
+   },
+   "outputs": [],
+   "source": [
+    "df.info()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "FJ4S44-N7qmS"
+   },
+   "source": [
+    "There is no missing data, therefore, we can move to the next stage. For Term frequency analysis, it is essential that the text data be preprocessed.\n",
+    "* Lowercase\n",
+    "* Remove punctutations\n",
+    "* Stopword removal"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "9AevOzM77qmS"
+   },
+   "outputs": [],
+   "source": [
+    "def clean(review):\n",
+    "\n",
+    "    review = review.lower()\n",
+    "    review = re.sub('[^a-z A-Z 0-9-]+', '', review)\n",
+    "    review = \" \".join([word for word in review.split() if word not in stopwords.words('english')])\n",
+    "\n",
+    "    return review"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "_kg_hide-output": true,
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 398
+    },
+    "id": "1GyDUkdZ7qmT",
+    "outputId": "6cf88ffa-e236-4e08-c079-32313385e4de"
+   },
+   "outputs": [],
+   "source": [
+    " import nltk\n",
+    " nltk.download('stopwords')\n",
+    "df['Review'] = df['Review'].apply(clean)\n",
+    "df.head(10)\n",
+    "# Convert Text to Lowercase\n",
+    "# Convert Text to Lowercase\n",
+    "#Remove Stopwords\n",
+    "#tokenization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 122
+    },
+    "id": "pwSmwu747qmT",
+    "outputId": "f1d16f69-13ff-4d5c-b443-034e12d1ba6d"
+   },
+   "outputs": [],
+   "source": [
+    "df['Review'][0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "84MJDfv37qmT"
+   },
+   "outputs": [],
+   "source": [
+    "def corpus(text):\n",
+    "    text_list = text.split()\n",
+    "    return text_list"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "_kg_hide-output": true,
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 502
+    },
+    "id": "w4PhDhS67qmT",
+    "outputId": "2be2b5de-f793-425d-cb1b-faf5c293b652"
+   },
+   "outputs": [],
+   "source": [
+    "df['Review_lists'] = df['Review'].apply(corpus)\n",
+    "df.head(10)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "LBFaC6gO7qmT",
+    "outputId": "11e0cb59-379b-4cf8-d358-9c574575cc98"
+   },
+   "outputs": [],
+   "source": [
+    "corpus = []\n",
+    "for i in trange(df.shape[0], ncols=150, nrows=10, colour='green', smoothing=0.8):\n",
+    "    corpus += df['Review_lists'][i]\n",
+    "len(corpus) #append all elements from the Review_lists column into corpus"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "J8nzbqaH7qmT",
+    "outputId": "73f954fd-c12a-4fc7-fee5-52e3355f1212"
+   },
+   "outputs": [],
+   "source": [
+    "mostCommon = Counter(corpus).most_common(10)\n",
+    "mostCommon"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "K6ODVB4-7qmT"
+   },
+   "outputs": [],
+   "source": [
+    "words = []\n",
+    "freq = []\n",
+    "for word, count in mostCommon:\n",
+    "    words.append(word)\n",
+    "    freq.append(count)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 340
+    },
+    "id": "Ir6FuvfM7qmU",
+    "outputId": "b8590f9b-ac31-4b67-e27d-57f7a98f8dff"
+   },
+   "outputs": [],
+   "source": [
+    "sns.barplot(x=freq, y=words)\n",
+    "plt.title('Top 10 Most Frequently Occuring Words')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "rYeCoTCB7qmU"
+   },
+   "source": [
+    "## Most Frequently occuring N_grams\n",
+    "\n",
+    "**What is an N-gram?** <br>\n",
+    "An n-gram is sequence of n words in a text. Most words by themselves may not present the entire context. Typically adverbs such as 'most' or 'very' are used to modify verbs and adjectives. Therefore, n-grams help analyse phrases and not just words which can lead to better insights.\n",
+    "<br>\n",
+    "> A **Bi-gram** means two words in a sequence. 'Very good' or 'Too great'<br>\n",
+    "> A **Tri-gram** means three words in a sequence. 'How was your day' would be broken down to 'How was your' and 'was your day'.<br>\n",
+    "\n",
+    "For separating text into n-grams, we will use `CountVectorizer` from Sklearn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "z1A_JYd07qmU"
+   },
+   "outputs": [],
+   "source": [
+    "cv = CountVectorizer(ngram_range=(2,2))\n",
+    "bigrams = cv.fit_transform(df['Review'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "qds7d7lx7qmd"
+   },
+   "outputs": [],
+   "source": [
+    "count_values = bigrams.toarray().sum(axis=0)\n",
+    "ngram_freq = pd.DataFrame(sorted([(count_values[i], k) for k, i in cv.vocabulary_.items()], reverse = True))\n",
+    "ngram_freq.columns = [\"frequency\", \"ngram\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 324
+    },
+    "id": "u4FTk_rI7qmd",
+    "outputId": "ace83676-57ba-43b6-a8e4-094cd98ffdc6"
+   },
+   "outputs": [],
+   "source": [
+    "sns.barplot(x=ngram_freq['frequency'][:10], y=ngram_freq['ngram'][:10])\n",
+    "plt.title('Top 10 Most Frequently Occuring Bigrams')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 373
+    },
+    "id": "y1iBCNhi7qme",
+    "outputId": "6a4fe357-27d4-4e00-82e1-7e23fc62c279"
+   },
+   "outputs": [],
+   "source": [
+    "cv1 = CountVectorizer(ngram_range=(3,3))\n",
+    "trigrams = cv1.fit_transform(df['Review'])\n",
+    "count_values = trigrams.toarray().sum(axis=0)\n",
+    "ngram_freq = pd.DataFrame(sorted([(count_values[i], k) for k, i in cv1.vocabulary_.items()], reverse = True))\n",
+    "ngram_freq.columns = [\"frequency\", \"ngram\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "XOcm6yFw7qme"
+   },
+   "outputs": [],
+   "source": [
+    "sns.barplot(x=ngram_freq['frequency'][:10], y=ngram_freq['ngram'][:10])\n",
+    "plt.title('Top 10 Most Frequently Occuring Trigrams')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "y-bWL2ce7qme"
+   },
+   "source": [
+    "<div class=\"alert alert-info\" role=\"alert\">\n",
+    "    <h2>But what about Word Clouds?</h2>\n",
+    "\n",
+    "<p>\n",
+    "    While word clouds are very appealing, they really don't provide a lot of information. A word or two are very obviously visible but other than that, there is not a lot to examine. <b>A simple bar plot may not be as attractive as a word cloud but it is surely more informative</b> - which is our ultimate goal. A word cloud may serve better as a cover to present your solution (which is why its right on top), but it can hardly be the solution. Of course, this is my personal opinion and word clouds should be used if they're absolutely needed. <br><br>\n",
+    "    What do you think? Let me know in the comments!</p>\n",
+    "</div>"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kaggle": {
+   "accelerator": "none",
+   "dataSources": [
+    {
+     "datasetId": 897156,
+     "sourceId": 1526618,
+     "sourceType": "datasetVersion"
+    }
+   ],
+   "dockerImageVersionId": 30260,
+   "isGpuEnabled": false,
+   "isInternetEnabled": true,
+   "language": "python",
+   "sourceType": "notebook"
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

noshot 0.4.1__py3-none-any.whl → 1.0.0__py3-none-any.whl

noshot 0.4.1py3-none-any.whl → 1.0.0py3-none-any.whl