RubyGems - j1-template - Versions diffs - 2022.2.2 → 2022.2.3 - Mend

j1-template 2022.2.2 → 2022.2.3

Files changed (62) hide show

data/lib/starter_web/pages/public/jupyter/notebooks/nbi-docs/.ipynb_checkpoints/nbi_docs_examples_correlation-checkpoint.ipynb DELETED Viewed

@@ -1,651 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# HIDDEN\n",
-    "from datascience import *\n",
-    "%matplotlib inline\n",
-    "import matplotlib.pyplot as plots\n",
-    "plots.style.use('fivethirtyeight')\n",
-    "import math\n",
-    "import numpy as np\n",
-    "from scipy import stats\n",
-    "from ipywidgets import interact, interactive, fixed, interact_manual\n",
-    "import ipywidgets as widgets\n",
-    "import nbinteract as nbi"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Correlation\n",
-    "In this section we will develop a measure of how tightly clustered a scatter diagram is about a straight line. Formally, this is called measuring *linear association*."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### The correlation coefficient\n",
-    "\n",
-    "The *correlation coefficient* measures the strength of the linear relationship between two variables. Graphically, it measures how clustered the scatter diagram is around a straight line.\n",
-    "\n",
-    "The term *correlation coefficient* is a quite long word, so usually the term shortened to *correlation* and denoted by $r$.\n",
-    "\n",
-    "Here are some mathematical facts about $r$ that we will just observe by simulation.\n",
-    "\n",
-    "- The correlation coefficient $r$ is a number between -1 and 1.\n",
-    "- $r$ measures the extent to which the scatter plot clusters around a straight line.\n",
-    "- $r$ = 1 if the scatter diagram is a perfect straight line sloping upwards, and $r$ = -1 if the scatter diagram is a perfect straight line sloping downwards."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The function ``r_scatter`` takes a value of $r$ as its argument and simulates a scatter plot with a correlation very close to $r$. Because of randomness in the simulation, the correlation is not expected to be exactly equal to $r$.\n",
-    "\n",
-    "Call ``r_scatter`` a few times, with different values of $r$ as the argument, and see how the scatter plot changes. \n",
-    "\n",
-    "When $r$ = 1 the scatter plot is perfectly linear and slopes upward. When $r$ = -1, the scatter plot is perfectly linear and slopes downward. When $r$ = 0, the scatter plot is a formless cloud around the horizontal axis, and the variables are said to be *uncorrelated*."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "d40394e76601418f9eaad00c672f41d1",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "VBox(children=(interactive(children=(FloatSlider(value=0.0, description='r', max=1.0, min=-1.0, step=0.05), Ou…"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
-   "source": [
-    "z = np.random.normal(0, 1, 500)\n",
-    "def r_scatter(xs, r):\n",
-    "    \"\"\"\n",
-    "    Generate y-values for a scatter plot with correlation approximately r\n",
-    "    \"\"\"\n",
-    "    return r*xs + (np.sqrt(1-r**2))*z\n",
-    "\n",
-    "corr_opts = {\n",
-    "    'aspect_ratio': 1,\n",
-    "    'xlim': (-3.5, 3.5),\n",
-    "    'ylim': (-3.5, 3.5),\n",
-    "}\n",
-    "\n",
-    "nbi.scatter(np.random.normal(size=500), r_scatter, options=corr_opts, r=(-1, 1, 0.05))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Calculating the correlation\n",
-    "\n",
-    "The formula for $r$ is not apparent from our observations so far. It has a mathematical basis that is outside the scope of this class. However, as you will see, the calculation is straightforward and helps us understand several of the properties of $r$.\n",
-    "\n",
-    "**Formula** for $r$:\n",
-    "\n",
-    "$r$ is the **average of the products of the two variables**, when both variables are measured in standard units.\n",
-    "\n",
-    "Here are the steps in the calculation. We will apply the steps to a simple table of values of **x** and **y**."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "    <thead>\n",
-       "        <tr>\n",
-       "            <th>x</th> <th>y</th>\n",
-       "        </tr>\n",
-       "    </thead>\n",
-       "    <tbody>\n",
-       "        <tr>\n",
-       "            <td>1   </td> <td>2   </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>2   </td> <td>3   </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>3   </td> <td>1   </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>4   </td> <td>5   </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>5   </td> <td>2   </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>6   </td> <td>7   </td>\n",
-       "        </tr>\n",
-       "    </tbody>\n",
-       "</table>"
-      ],
-      "text/plain": [
-       "x    | y\n",
-       "1    | 2\n",
-       "2    | 3\n",
-       "3    | 1\n",
-       "4    | 5\n",
-       "5    | 2\n",
-       "6    | 7"
-      ]
-     },
-     "execution_count": 3,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "x = np.arange(1, 7, 1)\n",
-    "y = make_array(2, 3, 1, 5, 2, 7)\n",
-    "t = Table().with_columns(\n",
-    "        'x', x,\n",
-    "        'y', y\n",
-    "    )\n",
-    "t"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Based on the scatter diagram, we expect that $r$ will be positive but not equal to 1."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "562201223b3e4e31846c056f1b44b96e",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "VBox(children=(interactive(children=(Output(),), _dom_classes=('widget-interact',)), Figure(axes=[Axis(scale=L…"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
-   "source": [
-    "nbi.scatter(t.column(0), t.column(1), options={'aspect_ratio': 1})"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Step 1.** Convert each variable to standard units."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def standard_units(nums):\n",
-    "    return (nums - np.mean(nums)) / np.std(nums)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "    <thead>\n",
-       "        <tr>\n",
-       "            <th>x</th> <th>y</th> <th>x (standard units)</th> <th>y (standard units)</th>\n",
-       "        </tr>\n",
-       "    </thead>\n",
-       "    <tbody>\n",
-       "        <tr>\n",
-       "            <td>1   </td> <td>2   </td> <td>-1.46385          </td> <td>-0.648886         </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>2   </td> <td>3   </td> <td>-0.87831          </td> <td>-0.162221         </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>3   </td> <td>1   </td> <td>-0.29277          </td> <td>-1.13555          </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>4   </td> <td>5   </td> <td>0.29277           </td> <td>0.811107          </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>5   </td> <td>2   </td> <td>0.87831           </td> <td>-0.648886         </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>6   </td> <td>7   </td> <td>1.46385           </td> <td>1.78444           </td>\n",
-       "        </tr>\n",
-       "    </tbody>\n",
-       "</table>"
-      ],
-      "text/plain": [
-       "x    | y    | x (standard units) | y (standard units)\n",
-       "1    | 2    | -1.46385           | -0.648886\n",
-       "2    | 3    | -0.87831           | -0.162221\n",
-       "3    | 1    | -0.29277           | -1.13555\n",
-       "4    | 5    | 0.29277            | 0.811107\n",
-       "5    | 2    | 0.87831            | -0.648886\n",
-       "6    | 7    | 1.46385            | 1.78444"
-      ]
-     },
-     "execution_count": 6,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "t_su = t.with_columns(\n",
-    "        'x (standard units)', standard_units(x),\n",
-    "        'y (standard units)', standard_units(y)\n",
-    "    )\n",
-    "t_su"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Step 2.** Multiply each pair of standard units."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "    <thead>\n",
-       "        <tr>\n",
-       "            <th>x</th> <th>y</th> <th>x (standard units)</th> <th>y (standard units)</th> <th>product of standard units</th>\n",
-       "        </tr>\n",
-       "    </thead>\n",
-       "    <tbody>\n",
-       "        <tr>\n",
-       "            <td>1   </td> <td>2   </td> <td>-1.46385          </td> <td>-0.648886         </td> <td>0.949871                 </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>2   </td> <td>3   </td> <td>-0.87831          </td> <td>-0.162221         </td> <td>0.142481                 </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>3   </td> <td>1   </td> <td>-0.29277          </td> <td>-1.13555          </td> <td>0.332455                 </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>4   </td> <td>5   </td> <td>0.29277           </td> <td>0.811107          </td> <td>0.237468                 </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>5   </td> <td>2   </td> <td>0.87831           </td> <td>-0.648886         </td> <td>-0.569923                </td>\n",
-       "        </tr>\n",
-       "        <tr>\n",
-       "            <td>6   </td> <td>7   </td> <td>1.46385           </td> <td>1.78444           </td> <td>2.61215                  </td>\n",
-       "        </tr>\n",
-       "    </tbody>\n",
-       "</table>"
-      ],
-      "text/plain": [
-       "x    | y    | x (standard units) | y (standard units) | product of standard units\n",
-       "1    | 2    | -1.46385           | -0.648886          | 0.949871\n",
-       "2    | 3    | -0.87831           | -0.162221          | 0.142481\n",
-       "3    | 1    | -0.29277           | -1.13555           | 0.332455\n",
-       "4    | 5    | 0.29277            | 0.811107           | 0.237468\n",
-       "5    | 2    | 0.87831            | -0.648886          | -0.569923\n",
-       "6    | 7    | 1.46385            | 1.78444            | 2.61215"
-      ]
-     },
-     "execution_count": 7,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "t_product = t_su.with_column('product of standard units', t_su.column(2) * t_su.column(3))\n",
-    "t_product"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Step 3.** $r$ is the average of the products computed in Step 2."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "0.6174163971897709"
-      ]
-     },
-     "execution_count": 8,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "# r is the average of the products of standard units\n",
-    "\n",
-    "r = np.mean(t_product.column(4))\n",
-    "r"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "As expected, $r$ is positive but not equal to 1."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Properties of $r$\n",
-    "\n",
-    "The calculation shows that:\n",
-    "\n",
-    "- $r$ is a pure number. It has no units. This is because $r$ is based on standard units.\n",
-    "- $r$ is unaffected by changing the units on either axis. This too is because $r$ is based on standard units.\n",
-    "- $r$ is unaffected by switching the axes. Algebraically, this is because the product of standard units does not depend on which variable is called **x** and which **y**. Geometrically, switching axes reflects the scatter plot about the line **y = x**, but does not change the amount of clustering nor the sign of the association."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "7373845f9f4c4826a8b60afd4df3cbc6",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "VBox(children=(interactive(children=(Output(),), _dom_classes=('widget-interact',)), Figure(axes=[Axis(scale=L…"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
-   "source": [
-    "nbi.scatter(t.column(1), t.column(0), options={'aspect_ratio': 1})"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### The correlation function\n",
-    "We are going to be calculating correlations repeatedly, so it will help to define a function that computes it by performing all the steps described above. Let's define a function ``correlation`` that takes a table and the labels of two columns in the table. The function returns $r$, the mean of the products of those column values in standard units."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def correlation(t, x, y):\n",
-    "    return np.mean(standard_units(t.column(x))*standard_units(t.column(y)))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 11,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "a229a4df59044f0887272fa2fa7cf8d2",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "interactive(children=(ToggleButtons(description='x-axis', options=('x', 'y'), value='x'), ToggleButtons(descri…"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "data": {
-      "text/plain": [
-       "<function __main__.correlation(t, x, y)>"
-      ]
-     },
-     "execution_count": 11,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "interact(correlation, t=fixed(t),\n",
-    "         x=widgets.ToggleButtons(options=['x', 'y'], description='x-axis'),\n",
-    "         y=widgets.ToggleButtons(options=['x', 'y'], description='y-axis'))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Let's call the function on the ``x`` and ``y`` columns of ``t``. The function returns the same answer to the correlation between $x$ and $y$ as we got by direct application of the formula for $r$. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "0.6174163971897709"
-      ]
-     },
-     "execution_count": 12,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "correlation(t, 'x', 'y')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "As we noticed, the order in which the variables are specified doesn't matter."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 13,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "0.6174163971897709"
-      ]
-     },
-     "execution_count": 13,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "correlation(t, 'y', 'x')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Calling ``correlation`` on columns of the table ``suv`` gives us the correlation between price and mileage as well as the correlation between price and acceleration."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 14,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "c82ed7aaa9184cf083050fe0e369211b",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "interactive(children=(ToggleButtons(description='x-axis', options=('mpg', 'msrp', 'acceleration'), value='mpg'…"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "data": {
-      "text/plain": [
-       "<function __main__.correlation(t, x, y)>"
-      ]
-     },
-     "execution_count": 14,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "suv = (Table.read_table('https://raw.githubusercontent.com/data-8/materials-fa17/master/lec/hybrid.csv')\n",
-    "       .where('class', 'SUV'))\n",
-    "\n",
-    "interact(correlation, t=fixed(suv),\n",
-    "         x=widgets.ToggleButtons(options=['mpg', 'msrp', 'acceleration'],\n",
-    "                                 description='x-axis'),\n",
-    "         y=widgets.ToggleButtons(options=['mpg', 'msrp', 'acceleration'],\n",
-    "                                 description='y-axis'))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 15,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "-0.6667143635709919"
-      ]
-     },
-     "execution_count": 15,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "correlation(suv, 'mpg', 'msrp')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 16,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "0.48699799279959155"
-      ]
-     },
-     "execution_count": 16,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "correlation(suv, 'acceleration', 'msrp')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "These values confirm what we had observed: \n",
-    "\n",
-    "- There is a negative association between price and efficiency, whereas the association between price and acceleration is positive.\n",
-    "- The linear relation between price and acceleration is a little weaker (correlation about 0.5) than between price and mileage (correlation about -0.67). "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Correlation is a simple and powerful concept, but it is sometimes misused. Before using $r$, it is important to be aware of what correlation does and does not measure."
-   ]
-  }
- ],
- "metadata": {
-  "anaconda-cloud": {},
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.9"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 1
-}