noshot 0.4.1__py3-none-any.whl → 1.0.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (29) hide show
  1. noshot/data/ML TS XAI/TS/10. Seasonal ARIMA Forecasting.ipynb +32 -714
  2. noshot/data/ML TS XAI/TS/11. Multivariate ARIMA Forecasting.ipynb +29 -1071
  3. noshot/data/ML TS XAI/TS/6. ACF PACF.ipynb +7 -105
  4. noshot/data/ML TS XAI/TS/7. Differencing.ipynb +16 -152
  5. noshot/data/ML TS XAI/TS/8. ARMA Forecasting.ipynb +26 -575
  6. noshot/data/ML TS XAI/TS/9. ARIMA Forecasting.ipynb +23 -382
  7. noshot/data/ML TS XAI/XAI/XAI 1/EDA2_chipsdatset.ipynb +633 -0
  8. noshot/data/ML TS XAI/XAI/XAI 1/EDA_IRISH_8thjan.ipynb +326 -0
  9. noshot/data/ML TS XAI/XAI/XAI 1/XAI_EX1 MODEL BIAS (FINAL).ipynb +487 -0
  10. noshot/data/ML TS XAI/XAI/XAI 1/complete_guide_to_eda_on_text_data.ipynb +845 -0
  11. noshot/data/ML TS XAI/XAI/XAI 1/deepchecksframeworks.ipynb +100 -0
  12. noshot/data/ML TS XAI/XAI/XAI 1/deepexplainers (mnist).ipynb +90 -0
  13. noshot/data/ML TS XAI/XAI/XAI 1/guidedbackpropagation.ipynb +203 -0
  14. noshot/data/ML TS XAI/XAI/XAI 1/updated_image_EDA1_with_LRP.ipynb +3998 -0
  15. noshot/data/ML TS XAI/XAI/XAI 1/zebrastripes.ipynb +271 -0
  16. noshot/data/ML TS XAI/XAI/XAI 2/EXP_5.ipynb +1545 -0
  17. noshot/data/ML TS XAI/XAI/XAI 2/Exp-3 (EDA-loan).ipynb +221 -0
  18. noshot/data/ML TS XAI/XAI/XAI 2/Exp-3 (EDA-movie).ipynb +229 -0
  19. noshot/data/ML TS XAI/XAI/XAI 2/Exp-4(Flower dataset).ipynb +237 -0
  20. noshot/data/ML TS XAI/XAI/XAI 2/Exp-4.ipynb +241 -0
  21. noshot/data/ML TS XAI/XAI/XAI 2/Exp_2.ipynb +352 -0
  22. noshot/data/ML TS XAI/XAI/XAI 2/Exp_7.ipynb +110 -0
  23. noshot/data/ML TS XAI/XAI/XAI 2/FeatureImportance_SensitivityAnalysis.ipynb +708 -0
  24. {noshot-0.4.1.dist-info → noshot-1.0.0.dist-info}/METADATA +1 -1
  25. noshot-1.0.0.dist-info/RECORD +32 -0
  26. noshot-0.4.1.dist-info/RECORD +0 -15
  27. {noshot-0.4.1.dist-info → noshot-1.0.0.dist-info}/WHEEL +0 -0
  28. {noshot-0.4.1.dist-info → noshot-1.0.0.dist-info}/licenses/LICENSE.txt +0 -0
  29. {noshot-0.4.1.dist-info → noshot-1.0.0.dist-info}/top_level.txt +0 -0
@@ -0,0 +1,845 @@
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {
6
+ "id": "xXn0bq0L7qmN"
7
+ },
8
+ "source": [
9
+ "## Setup"
10
+ ]
11
+ },
12
+ {
13
+ "cell_type": "code",
14
+ "execution_count": null,
15
+ "metadata": {
16
+ "id": "JB9sNbP87qmO"
17
+ },
18
+ "outputs": [],
19
+ "source": [
20
+ "import numpy as np\n",
21
+ "import pandas as pd\n",
22
+ "import matplotlib.pyplot as plt\n",
23
+ "import seaborn as sns\n",
24
+ "import string\n",
25
+ "import re\n",
26
+ "import nltk\n",
27
+ "\n",
28
+ "from tqdm import trange\n",
29
+ "from nltk import tokenize\n",
30
+ "from nltk.corpus import stopwords\n",
31
+ "from nltk.stem import WordNetLemmatizer\n",
32
+ "from nltk.probability import FreqDist\n",
33
+ "from collections import Counter\n",
34
+ "from sklearn.feature_extraction.text import CountVectorizer"
35
+ ]
36
+ },
37
+ {
38
+ "cell_type": "markdown",
39
+ "metadata": {
40
+ "id": "5UfIxL3eXIaq"
41
+ },
42
+ "source": [
43
+ "\n",
44
+ "numpy: Useful for numerical operations and array manipulations.\n",
45
+ "pandas: Ideal for data manipulation and analysis using DataFrames.\n",
46
+ "Libraries for Visualization:\n",
47
+ "matplotlib.pyplot: Provides plotting capabilities for creating static, interactive, and animated visualizations.\n",
48
+ "seaborn: Enhances matplotlib by providing a high-level interface for drawing attractive statistical graphics.\n",
49
+ "Libraries for Text Processing:\n",
50
+ "string: Provides constants and classes for string operations.\n",
51
+ "re: Supports regular expression operations for pattern matching and text processing.\n",
52
+ "nltk (Natural Language Toolkit): A suite of libraries for natural language processing. Specific modules used here include:\n",
53
+ "nltk.tokenize: For splitting text into words or sentences.\n",
54
+ "nltk.corpus.stopwords: Provides a list of common stopwords in various languages.\n",
55
+ "nltk.stem.WordNetLemmatizer: For reducing words to their base or root form.\n",
56
+ "nltk.probability.FreqDist: Computes the frequency distribution of words or events.\n",
57
+ "Utility Libraries:\n",
58
+ "tqdm.trange: Adds a progress bar to loops, providing feedback on execution progress.\n",
59
+ "Data Structures and Algorithms:\n",
60
+ "collections.Counter: Counts occurrences of elements in an iterable, useful for frequency analysis.\n",
61
+ "Feature Extraction:\n",
62
+ "sklearn.feature_extraction.text.CountVectorizer: Converts a collection of text documents to a matrix of token counts."
63
+ ]
64
+ },
65
+ {
66
+ "cell_type": "markdown",
67
+ "metadata": {
68
+ "id": "HAKJYtF8WsEO"
69
+ },
70
+ "source": []
71
+ },
72
+ {
73
+ "cell_type": "code",
74
+ "execution_count": null,
75
+ "metadata": {
76
+ "id": "gbCBi_3a7qmP"
77
+ },
78
+ "outputs": [],
79
+ "source": [
80
+ "import warnings\n",
81
+ "warnings.filterwarnings('ignore') #Suppresses warning messages,\n",
82
+ "nltk.download('omw-1.4', quiet=True)\n",
83
+ "sns.set_style('darkgrid')\n",
84
+ "plt.rcParams['figure.figsize'] = (17,7) #Sets global parameters for Matplotlib plots, runtime configuration parameters\n",
85
+ "plt.rcParams['font.size'] = 18"
86
+ ]
87
+ },
88
+ {
89
+ "cell_type": "markdown",
90
+ "metadata": {
91
+ "id": "iV6S7DklXTRe"
92
+ },
93
+ "source": []
94
+ },
95
+ {
96
+ "cell_type": "markdown",
97
+ "metadata": {
98
+ "id": "1OtCQoeg7qmP"
99
+ },
100
+ "source": [
101
+ "## Loading the Data"
102
+ ]
103
+ },
104
+ {
105
+ "cell_type": "code",
106
+ "execution_count": null,
107
+ "metadata": {
108
+ "_kg_hide-input": false,
109
+ "_kg_hide-output": true,
110
+ "colab": {
111
+ "base_uri": "https://localhost:8080/",
112
+ "height": 363
113
+ },
114
+ "id": "fQUgnEjX7qmQ",
115
+ "outputId": "3f3993d0-3f0b-4636-ad3a-d8222f9554e0"
116
+ },
117
+ "outputs": [],
118
+ "source": [
119
+ "data = pd.read_csv(\"tripadvisor_hotel_reviews.csv\")\n",
120
+ "data.head(10)"
121
+ ]
122
+ },
123
+ {
124
+ "cell_type": "markdown",
125
+ "metadata": {
126
+ "id": "Hbc9MyUx7qmQ"
127
+ },
128
+ "source": [
129
+ "Now that we have our data, we can begin with the EDA.<br>**But first**, we need to transform the 'Rating' column to binary labels"
130
+ ]
131
+ },
132
+ {
133
+ "cell_type": "code",
134
+ "execution_count": null,
135
+ "metadata": {
136
+ "colab": {
137
+ "base_uri": "https://localhost:8080/",
138
+ "height": 272
139
+ },
140
+ "id": "cojecQBE7qmQ",
141
+ "outputId": "4c8e3d96-650c-4ab5-f9a0-e11a806f0581"
142
+ },
143
+ "outputs": [],
144
+ "source": [
145
+ "data['Rating'].value_counts() #frequency of each unique value in the Rating colum"
146
+ ]
147
+ },
148
+ {
149
+ "cell_type": "code",
150
+ "execution_count": null,
151
+ "metadata": {
152
+ "id": "BCpH9eNS7qmQ"
153
+ },
154
+ "outputs": [],
155
+ "source": [
156
+ "# rating 4, 5 => Positive; 1, 2, 3 => Negative\n",
157
+ "def ratings(rating):\n",
158
+ " if rating>3 and rating<=5:\n",
159
+ " return \"Positive\"\n",
160
+ " if rating>0 and rating<=3:\n",
161
+ " return \"Negative\""
162
+ ]
163
+ },
164
+ {
165
+ "cell_type": "code",
166
+ "execution_count": null,
167
+ "metadata": {
168
+ "colab": {
169
+ "base_uri": "https://localhost:8080/",
170
+ "height": 576
171
+ },
172
+ "id": "qYYYGO6m7qmQ",
173
+ "outputId": "6c952ddf-74bf-40f0-b656-ea483acaf373"
174
+ },
175
+ "outputs": [],
176
+ "source": [
177
+ "data['Rating'] = data['Rating'].apply(ratings)# apply() method applies a function (ratings) to each element in the Rating column.\n",
178
+ "plt.pie(data['Rating'].value_counts(), labels=data['Rating'].unique().tolist(), autopct='%1.1f%%')\n",
179
+ "plt.show()"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "markdown",
184
+ "metadata": {
185
+ "id": "jjKQA2B87qmQ"
186
+ },
187
+ "source": [
188
+ "## Exploratory Data Analysis\n",
189
+ "\n",
190
+ "### Counts and Lenght:\n",
191
+ "Start by checking how long the reviews are\n",
192
+ "* Character count\n",
193
+ "* Word count\n",
194
+ "* Mean word length\n",
195
+ "* Mean sentence length"
196
+ ]
197
+ },
198
+ {
199
+ "cell_type": "code",
200
+ "execution_count": null,
201
+ "metadata": {
202
+ "colab": {
203
+ "base_uri": "https://localhost:8080/"
204
+ },
205
+ "id": "v5PdqZqS7qmQ",
206
+ "outputId": "dcb71d91-3ac1-4725-a74d-269e702ac256"
207
+ },
208
+ "outputs": [],
209
+ "source": [
210
+ "lenght = len(data['Review'][0])#irst element (row) of the Review column in the DataFrame.\n",
211
+ "print(f'Length of a sample review: {lenght}')"
212
+ ]
213
+ },
214
+ {
215
+ "cell_type": "markdown",
216
+ "metadata": {
217
+ "id": "ea-EoGmhzOLD"
218
+ },
219
+ "source": [
220
+ "nice hotel expensive parking got good deal stayed sat night because attending event hotel clean comfortable would stay again bargain price parking good central location\" , 593 characters"
221
+ ]
222
+ },
223
+ {
224
+ "cell_type": "code",
225
+ "execution_count": null,
226
+ "metadata": {
227
+ "_kg_hide-output": true,
228
+ "colab": {
229
+ "base_uri": "https://localhost:8080/",
230
+ "height": 363
231
+ },
232
+ "id": "91biDBhs7qmR",
233
+ "outputId": "58786188-d992-4f32-a35a-7fcb5b875ee3"
234
+ },
235
+ "outputs": [],
236
+ "source": [
237
+ "data['Length'] = data['Review'].str.len()\n",
238
+ "data.head(10)"
239
+ ]
240
+ },
241
+ {
242
+ "cell_type": "markdown",
243
+ "metadata": {
244
+ "id": "4hIpfCYn7qmR"
245
+ },
246
+ "source": [
247
+ "#### **Word Count**: Number of words in a review"
248
+ ]
249
+ },
250
+ {
251
+ "cell_type": "code",
252
+ "execution_count": null,
253
+ "metadata": {
254
+ "colab": {
255
+ "base_uri": "https://localhost:8080/"
256
+ },
257
+ "id": "ux3EvuGI7qmR",
258
+ "outputId": "ccbf3132-694e-484d-9338-58e9fc7f4821"
259
+ },
260
+ "outputs": [],
261
+ "source": [
262
+ "word_count = data['Review'][0].split()\n",
263
+ "print(f'Word count in a sample review: {len(word_count)}')"
264
+ ]
265
+ },
266
+ {
267
+ "cell_type": "code",
268
+ "execution_count": null,
269
+ "metadata": {
270
+ "id": "HxubAb-p7qmR"
271
+ },
272
+ "outputs": [],
273
+ "source": [
274
+ "def word_count(review):\n",
275
+ " review_list = review.split()\n",
276
+ " return len(review_list)"
277
+ ]
278
+ },
279
+ {
280
+ "cell_type": "code",
281
+ "execution_count": null,
282
+ "metadata": {
283
+ "_kg_hide-output": true,
284
+ "colab": {
285
+ "base_uri": "https://localhost:8080/",
286
+ "height": 363
287
+ },
288
+ "id": "sSR54vCk7qmR",
289
+ "outputId": "22ddfaf2-4a60-4b38-e350-2f33bc2a1cb4"
290
+ },
291
+ "outputs": [],
292
+ "source": [
293
+ "data['Word_count'] = data['Review'].apply(word_count)\n",
294
+ "data.head(10)"
295
+ ]
296
+ },
297
+ {
298
+ "cell_type": "markdown",
299
+ "metadata": {
300
+ "id": "S_NCkk9k7qmR"
301
+ },
302
+ "source": [
303
+ "#### **Mean word length**: Average length of words"
304
+ ]
305
+ },
306
+ {
307
+ "cell_type": "code",
308
+ "execution_count": null,
309
+ "metadata": {
310
+ "_kg_hide-output": true,
311
+ "colab": {
312
+ "base_uri": "https://localhost:8080/",
313
+ "height": 380
314
+ },
315
+ "id": "FuRGrjON7qmR",
316
+ "outputId": "67adf4aa-a4da-4139-da44-4078ed221b91"
317
+ },
318
+ "outputs": [],
319
+ "source": [
320
+ "data['mean_word_length'] = data['Review'].map(lambda rev: np.mean([len(word) for word in rev.split()]))\n",
321
+ "#average length of words in each review\n",
322
+ "data.head(10)"
323
+ ]
324
+ },
325
+ {
326
+ "cell_type": "markdown",
327
+ "metadata": {
328
+ "id": "_SfDq14lz8vS"
329
+ },
330
+ "source": [
331
+ "Mean Word Length=\n",
332
+ "Word Count/\n",
333
+ "Length of the Review\n",
334
+ "​\n",
335
+ "\n",
336
+ "For example, for the first review:\n",
337
+ "\n",
338
+ "Length of the Review: 593\n",
339
+ "Word Count: 87\n",
340
+ "Mean Word Length\n",
341
+ "=\n",
342
+ "593/\n",
343
+ "87\n",
344
+ "≈\n",
345
+ "5.804598\n",
346
+ "Mean Word Length=\n",
347
+ "87\n",
348
+ "593\n",
349
+ "​\n",
350
+ " ≈5.804598"
351
+ ]
352
+ },
353
+ {
354
+ "cell_type": "markdown",
355
+ "metadata": {
356
+ "id": "sMDMKTT07qmS"
357
+ },
358
+ "source": [
359
+ "#### **Mean sentence length**: Average length of the sentences in the review"
360
+ ]
361
+ },
362
+ {
363
+ "cell_type": "code",
364
+ "execution_count": null,
365
+ "metadata": {
366
+ "colab": {
367
+ "base_uri": "https://localhost:8080/"
368
+ },
369
+ "id": "StG9kd-57qmS",
370
+ "outputId": "2a693ea3-4b11-49c6-b6f7-a49d6d914c14"
371
+ },
372
+ "outputs": [],
373
+ "source": [
374
+ "import nltk\n",
375
+ "\n",
376
+ "nltk.download('punkt_tab')\n",
377
+ "\n",
378
+ "np.mean([len(sent) for sent in tokenize.sent_tokenize(data['Review'][0])])"
379
+ ]
380
+ },
381
+ {
382
+ "cell_type": "markdown",
383
+ "metadata": {
384
+ "id": "-30osJkyud2B"
385
+ },
386
+ "source": [
387
+ "tokenize.sent_tokenize(data['Review'][0]): Splits the first review (data['Review'][0]) into individual sentences.\n",
388
+ "len(sent): Calculates the number of characters in each sentence.\n",
389
+ "[len(sent) for sent in ...]: Creates a list of sentence lengths for the review.\n",
390
+ "np.mean(...): Calculates the mean (average) of the sentence lengths."
391
+ ]
392
+ },
393
+ {
394
+ "cell_type": "code",
395
+ "execution_count": null,
396
+ "metadata": {
397
+ "colab": {
398
+ "base_uri": "https://localhost:8080/",
399
+ "height": 589
400
+ },
401
+ "id": "djChgL8w7qmS",
402
+ "outputId": "838dce03-c1b8-42a9-d3d8-9b8a83731b50"
403
+ },
404
+ "outputs": [],
405
+ "source": [
406
+ "data['mean_sent_length'] = data['Review'].map(lambda rev: np.mean([len(sent) for sent in tokenize.sent_tokenize(rev)]))\n",
407
+ "data.head(10)"
408
+ ]
409
+ },
410
+ {
411
+ "cell_type": "markdown",
412
+ "metadata": {
413
+ "id": "GZ0Ab4Eq3nLw"
414
+ },
415
+ "source": [
416
+ "Mean Sentence Length=\n",
417
+ "\n",
418
+ "Length of the Review/Number of Sentences\n",
419
+ "​\n",
420
+ " =\n",
421
+ "1\n",
422
+ "593\n",
423
+ "​\n",
424
+ " =591.0"
425
+ ]
426
+ },
427
+ {
428
+ "cell_type": "markdown",
429
+ "metadata": {
430
+ "id": "gu3Oa0ztvP-R"
431
+ },
432
+ "source": [
433
+ "Row 1:\n",
434
+ "Sentences: [\"I love this product.\", \"It works well.\"]\n",
435
+ "Lengths: [20, 14]\n",
436
+ "Mean: (20 + 14) / 2 = 17.0\n",
437
+ "Row 2:\n",
438
+ "Sentences: [\"Not worth the price.\", \"Too expensive and low quality.\"]\n",
439
+ "Lengths: [21, 29]\n",
440
+ "Mean: (21 + 29) / 2 = 25.0\n",
441
+ "The mean_sent_length column will contain these averages for each review."
442
+ ]
443
+ },
444
+ {
445
+ "cell_type": "code",
446
+ "execution_count": null,
447
+ "metadata": {
448
+ "id": "tkn7Rifa7qmS"
449
+ },
450
+ "outputs": [],
451
+ "source": [
452
+ "def visualize(col):\n",
453
+ "\n",
454
+ " print()\n",
455
+ " plt.subplot(1,2,1)\n",
456
+ " sns.boxplot(y=data[col], x=data['Rating']) # Changed hue to x\n",
457
+ " plt.ylabel(col, labelpad=12.5)\n",
458
+ "\n",
459
+ " plt.subplot(1,2,2)\n",
460
+ " sns.kdeplot(x=data[col], hue=data['Rating']) # Changed data[col] to x=data[col]\n",
461
+ " plt.legend(data['Rating'].unique())\n",
462
+ " plt.xlabel('')\n",
463
+ " plt.ylabel('')\n",
464
+ "\n",
465
+ "plt.show() # Moved plt.show() outside the loop\n"
466
+ ]
467
+ },
468
+ {
469
+ "cell_type": "code",
470
+ "execution_count": null,
471
+ "metadata": {
472
+ "colab": {
473
+ "base_uri": "https://localhost:8080/",
474
+ "height": 406
475
+ },
476
+ "id": "ngsxYq7B7qmS",
477
+ "outputId": "3f2df19c-4c34-4632-e053-9f32cef24ce0"
478
+ },
479
+ "outputs": [],
480
+ "source": [
481
+ "features = data.columns.tolist()[2:]\n",
482
+ "for feature in features:\n",
483
+ " visualize(feature)"
484
+ ]
485
+ },
486
+ {
487
+ "cell_type": "markdown",
488
+ "metadata": {
489
+ "id": "d1We3wN-7qmS"
490
+ },
491
+ "source": [
492
+ "## Term Frequency Analysis\n",
493
+ "Examining the most frequently occuring words is one of the most popular systems of Text analytics. For example, in a sentiment analysis problem, a positive text is bound to have words like 'good', 'great', 'nice', etc. more in number than other words that imply otherwise.\n",
494
+ "\n",
495
+ "*Note*: Term Frequencies are more than counts and lenghts, so the first requirement is to preprocess the text"
496
+ ]
497
+ },
498
+ {
499
+ "cell_type": "code",
500
+ "execution_count": null,
501
+ "metadata": {
502
+ "colab": {
503
+ "base_uri": "https://localhost:8080/",
504
+ "height": 206
505
+ },
506
+ "id": "X5hNJsve7qmS",
507
+ "outputId": "ce658f9e-c17c-4209-a422-6a401d3779c4"
508
+ },
509
+ "outputs": [],
510
+ "source": [
511
+ "df = data.drop(features, axis=1)\n",
512
+ "df.head()"
513
+ ]
514
+ },
515
+ {
516
+ "cell_type": "code",
517
+ "execution_count": null,
518
+ "metadata": {
519
+ "colab": {
520
+ "base_uri": "https://localhost:8080/"
521
+ },
522
+ "id": "mUt47IDG7qmS",
523
+ "outputId": "914d80d2-c511-4a3a-eba8-780af3985eec"
524
+ },
525
+ "outputs": [],
526
+ "source": [
527
+ "df.info()"
528
+ ]
529
+ },
530
+ {
531
+ "cell_type": "markdown",
532
+ "metadata": {
533
+ "id": "FJ4S44-N7qmS"
534
+ },
535
+ "source": [
536
+ "There is no missing data, therefore, we can move to the next stage. For Term frequency analysis, it is essential that the text data be preprocessed.\n",
537
+ "* Lowercase\n",
538
+ "* Remove punctutations\n",
539
+ "* Stopword removal"
540
+ ]
541
+ },
542
+ {
543
+ "cell_type": "code",
544
+ "execution_count": null,
545
+ "metadata": {
546
+ "id": "9AevOzM77qmS"
547
+ },
548
+ "outputs": [],
549
+ "source": [
550
+ "def clean(review):\n",
551
+ "\n",
552
+ " review = review.lower()\n",
553
+ " review = re.sub('[^a-z A-Z 0-9-]+', '', review)\n",
554
+ " review = \" \".join([word for word in review.split() if word not in stopwords.words('english')])\n",
555
+ "\n",
556
+ " return review"
557
+ ]
558
+ },
559
+ {
560
+ "cell_type": "code",
561
+ "execution_count": null,
562
+ "metadata": {
563
+ "_kg_hide-output": true,
564
+ "colab": {
565
+ "base_uri": "https://localhost:8080/",
566
+ "height": 398
567
+ },
568
+ "id": "1GyDUkdZ7qmT",
569
+ "outputId": "6cf88ffa-e236-4e08-c079-32313385e4de"
570
+ },
571
+ "outputs": [],
572
+ "source": [
573
+ " import nltk\n",
574
+ " nltk.download('stopwords')\n",
575
+ "df['Review'] = df['Review'].apply(clean)\n",
576
+ "df.head(10)\n",
577
+ "# Convert Text to Lowercase\n",
578
+ "# Convert Text to Lowercase\n",
579
+ "#Remove Stopwords\n",
580
+ "#tokenization"
581
+ ]
582
+ },
583
+ {
584
+ "cell_type": "code",
585
+ "execution_count": null,
586
+ "metadata": {
587
+ "colab": {
588
+ "base_uri": "https://localhost:8080/",
589
+ "height": 122
590
+ },
591
+ "id": "pwSmwu747qmT",
592
+ "outputId": "f1d16f69-13ff-4d5c-b443-034e12d1ba6d"
593
+ },
594
+ "outputs": [],
595
+ "source": [
596
+ "df['Review'][0]"
597
+ ]
598
+ },
599
+ {
600
+ "cell_type": "code",
601
+ "execution_count": null,
602
+ "metadata": {
603
+ "id": "84MJDfv37qmT"
604
+ },
605
+ "outputs": [],
606
+ "source": [
607
+ "def corpus(text):\n",
608
+ " text_list = text.split()\n",
609
+ " return text_list"
610
+ ]
611
+ },
612
+ {
613
+ "cell_type": "code",
614
+ "execution_count": null,
615
+ "metadata": {
616
+ "_kg_hide-output": true,
617
+ "colab": {
618
+ "base_uri": "https://localhost:8080/",
619
+ "height": 502
620
+ },
621
+ "id": "w4PhDhS67qmT",
622
+ "outputId": "2be2b5de-f793-425d-cb1b-faf5c293b652"
623
+ },
624
+ "outputs": [],
625
+ "source": [
626
+ "df['Review_lists'] = df['Review'].apply(corpus)\n",
627
+ "df.head(10)"
628
+ ]
629
+ },
630
+ {
631
+ "cell_type": "code",
632
+ "execution_count": null,
633
+ "metadata": {
634
+ "colab": {
635
+ "base_uri": "https://localhost:8080/"
636
+ },
637
+ "id": "LBFaC6gO7qmT",
638
+ "outputId": "11e0cb59-379b-4cf8-d358-9c574575cc98"
639
+ },
640
+ "outputs": [],
641
+ "source": [
642
+ "corpus = []\n",
643
+ "for i in trange(df.shape[0], ncols=150, nrows=10, colour='green', smoothing=0.8):\n",
644
+ " corpus += df['Review_lists'][i]\n",
645
+ "len(corpus) #append all elements from the Review_lists column into corpus"
646
+ ]
647
+ },
648
+ {
649
+ "cell_type": "code",
650
+ "execution_count": null,
651
+ "metadata": {
652
+ "colab": {
653
+ "base_uri": "https://localhost:8080/"
654
+ },
655
+ "id": "J8nzbqaH7qmT",
656
+ "outputId": "73f954fd-c12a-4fc7-fee5-52e3355f1212"
657
+ },
658
+ "outputs": [],
659
+ "source": [
660
+ "mostCommon = Counter(corpus).most_common(10)\n",
661
+ "mostCommon"
662
+ ]
663
+ },
664
+ {
665
+ "cell_type": "code",
666
+ "execution_count": null,
667
+ "metadata": {
668
+ "id": "K6ODVB4-7qmT"
669
+ },
670
+ "outputs": [],
671
+ "source": [
672
+ "words = []\n",
673
+ "freq = []\n",
674
+ "for word, count in mostCommon:\n",
675
+ " words.append(word)\n",
676
+ " freq.append(count)"
677
+ ]
678
+ },
679
+ {
680
+ "cell_type": "code",
681
+ "execution_count": null,
682
+ "metadata": {
683
+ "colab": {
684
+ "base_uri": "https://localhost:8080/",
685
+ "height": 340
686
+ },
687
+ "id": "Ir6FuvfM7qmU",
688
+ "outputId": "b8590f9b-ac31-4b67-e27d-57f7a98f8dff"
689
+ },
690
+ "outputs": [],
691
+ "source": [
692
+ "sns.barplot(x=freq, y=words)\n",
693
+ "plt.title('Top 10 Most Frequently Occuring Words')\n",
694
+ "plt.show()"
695
+ ]
696
+ },
697
+ {
698
+ "cell_type": "markdown",
699
+ "metadata": {
700
+ "id": "rYeCoTCB7qmU"
701
+ },
702
+ "source": [
703
+ "## Most Frequently occuring N_grams\n",
704
+ "\n",
705
+ "**What is an N-gram?** <br>\n",
706
+ "An n-gram is sequence of n words in a text. Most words by themselves may not present the entire context. Typically adverbs such as 'most' or 'very' are used to modify verbs and adjectives. Therefore, n-grams help analyse phrases and not just words which can lead to better insights.\n",
707
+ "<br>\n",
708
+ "> A **Bi-gram** means two words in a sequence. 'Very good' or 'Too great'<br>\n",
709
+ "> A **Tri-gram** means three words in a sequence. 'How was your day' would be broken down to 'How was your' and 'was your day'.<br>\n",
710
+ "\n",
711
+ "For separating text into n-grams, we will use `CountVectorizer` from Sklearn"
712
+ ]
713
+ },
714
+ {
715
+ "cell_type": "code",
716
+ "execution_count": null,
717
+ "metadata": {
718
+ "id": "z1A_JYd07qmU"
719
+ },
720
+ "outputs": [],
721
+ "source": [
722
+ "cv = CountVectorizer(ngram_range=(2,2))\n",
723
+ "bigrams = cv.fit_transform(df['Review'])"
724
+ ]
725
+ },
726
+ {
727
+ "cell_type": "code",
728
+ "execution_count": null,
729
+ "metadata": {
730
+ "id": "qds7d7lx7qmd"
731
+ },
732
+ "outputs": [],
733
+ "source": [
734
+ "count_values = bigrams.toarray().sum(axis=0)\n",
735
+ "ngram_freq = pd.DataFrame(sorted([(count_values[i], k) for k, i in cv.vocabulary_.items()], reverse = True))\n",
736
+ "ngram_freq.columns = [\"frequency\", \"ngram\"]"
737
+ ]
738
+ },
739
+ {
740
+ "cell_type": "code",
741
+ "execution_count": null,
742
+ "metadata": {
743
+ "colab": {
744
+ "base_uri": "https://localhost:8080/",
745
+ "height": 324
746
+ },
747
+ "id": "u4FTk_rI7qmd",
748
+ "outputId": "ace83676-57ba-43b6-a8e4-094cd98ffdc6"
749
+ },
750
+ "outputs": [],
751
+ "source": [
752
+ "sns.barplot(x=ngram_freq['frequency'][:10], y=ngram_freq['ngram'][:10])\n",
753
+ "plt.title('Top 10 Most Frequently Occuring Bigrams')\n",
754
+ "plt.show()"
755
+ ]
756
+ },
757
+ {
758
+ "cell_type": "code",
759
+ "execution_count": null,
760
+ "metadata": {
761
+ "colab": {
762
+ "base_uri": "https://localhost:8080/",
763
+ "height": 373
764
+ },
765
+ "id": "y1iBCNhi7qme",
766
+ "outputId": "6a4fe357-27d4-4e00-82e1-7e23fc62c279"
767
+ },
768
+ "outputs": [],
769
+ "source": [
770
+ "cv1 = CountVectorizer(ngram_range=(3,3))\n",
771
+ "trigrams = cv1.fit_transform(df['Review'])\n",
772
+ "count_values = trigrams.toarray().sum(axis=0)\n",
773
+ "ngram_freq = pd.DataFrame(sorted([(count_values[i], k) for k, i in cv1.vocabulary_.items()], reverse = True))\n",
774
+ "ngram_freq.columns = [\"frequency\", \"ngram\"]"
775
+ ]
776
+ },
777
+ {
778
+ "cell_type": "code",
779
+ "execution_count": null,
780
+ "metadata": {
781
+ "id": "XOcm6yFw7qme"
782
+ },
783
+ "outputs": [],
784
+ "source": [
785
+ "sns.barplot(x=ngram_freq['frequency'][:10], y=ngram_freq['ngram'][:10])\n",
786
+ "plt.title('Top 10 Most Frequently Occuring Trigrams')\n",
787
+ "plt.show()"
788
+ ]
789
+ },
790
+ {
791
+ "cell_type": "markdown",
792
+ "metadata": {
793
+ "id": "y-bWL2ce7qme"
794
+ },
795
+ "source": [
796
+ "<div class=\"alert alert-info\" role=\"alert\">\n",
797
+ " <h2>But what about Word Clouds?</h2>\n",
798
+ "\n",
799
+ "<p>\n",
800
+ " While word clouds are very appealing, they really don't provide a lot of information. A word or two are very obviously visible but other than that, there is not a lot to examine. <b>A simple bar plot may not be as attractive as a word cloud but it is surely more informative</b> - which is our ultimate goal. A word cloud may serve better as a cover to present your solution (which is why its right on top), but it can hardly be the solution. Of course, this is my personal opinion and word clouds should be used if they're absolutely needed. <br><br>\n",
801
+ " What do you think? Let me know in the comments!</p>\n",
802
+ "</div>"
803
+ ]
804
+ }
805
+ ],
806
+ "metadata": {
807
+ "colab": {
808
+ "provenance": []
809
+ },
810
+ "kaggle": {
811
+ "accelerator": "none",
812
+ "dataSources": [
813
+ {
814
+ "datasetId": 897156,
815
+ "sourceId": 1526618,
816
+ "sourceType": "datasetVersion"
817
+ }
818
+ ],
819
+ "dockerImageVersionId": 30260,
820
+ "isGpuEnabled": false,
821
+ "isInternetEnabled": true,
822
+ "language": "python",
823
+ "sourceType": "notebook"
824
+ },
825
+ "kernelspec": {
826
+ "display_name": "Python 3 (ipykernel)",
827
+ "language": "python",
828
+ "name": "python3"
829
+ },
830
+ "language_info": {
831
+ "codemirror_mode": {
832
+ "name": "ipython",
833
+ "version": 3
834
+ },
835
+ "file_extension": ".py",
836
+ "mimetype": "text/x-python",
837
+ "name": "python",
838
+ "nbconvert_exporter": "python",
839
+ "pygments_lexer": "ipython3",
840
+ "version": "3.12.4"
841
+ }
842
+ },
843
+ "nbformat": 4,
844
+ "nbformat_minor": 4
845
+ }