my-markdown-library 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/F24LS_md/ Lecture 4 - Public.md +347 -0
- data/F24LS_md/Lecture 1 - Introduction and Overview.md +327 -0
- data/F24LS_md/Lecture 10 - Development_.md +631 -0
- data/F24LS_md/Lecture 11 - Econometrics.md +345 -0
- data/F24LS_md/Lecture 12 - Finance.md +692 -0
- data/F24LS_md/Lecture 13 - Environmental Economics.md +299 -0
- data/F24LS_md/Lecture 15 - Conclusion.md +272 -0
- data/F24LS_md/Lecture 2 - Demand.md +349 -0
- data/F24LS_md/Lecture 3 - Supply.md +329 -0
- data/F24LS_md/Lecture 5 - Production C-D.md +291 -0
- data/F24LS_md/Lecture 6 - Utility and Latex.md +440 -0
- data/F24LS_md/Lecture 7 - Inequality.md +607 -0
- data/F24LS_md/Lecture 8 - Macroeconomics.md +704 -0
- data/F24LS_md/Lecture 8 - Macro.md +700 -0
- data/F24LS_md/Lecture 9 - Game Theory_.md +436 -0
- data/F24LS_md/summary.yaml +105 -0
- data/F24Lec_MD/LecNB_summary.yaml +206 -0
- data/F24Lec_MD/lec01/lec01.md +267 -0
- data/F24Lec_MD/lec02/Avocados_demand.md +425 -0
- data/F24Lec_MD/lec02/Demand_Steps_24.md +126 -0
- data/F24Lec_MD/lec02/PriceElasticity.md +83 -0
- data/F24Lec_MD/lec02/ScannerData_Beer.md +171 -0
- data/F24Lec_MD/lec02/demand-curve-Fa24.md +213 -0
- data/F24Lec_MD/lec03/3.0-CubicCostCurve.md +239 -0
- data/F24Lec_MD/lec03/3.1-Supply.md +274 -0
- data/F24Lec_MD/lec03/3.2-sympy.md +332 -0
- data/F24Lec_MD/lec03/3.3a-california-energy.md +120 -0
- data/F24Lec_MD/lec03/3.3b-a-really-hot-tuesday.md +121 -0
- data/F24Lec_MD/lec04/lec04-CSfromSurvey-closed.md +335 -0
- data/F24Lec_MD/lec04/lec04-CSfromSurvey.md +331 -0
- data/F24Lec_MD/lec04/lec04-Supply-Demand-closed.md +519 -0
- data/F24Lec_MD/lec04/lec04-Supply-Demand.md +514 -0
- data/F24Lec_MD/lec04/lec04-four-plot-24.md +34 -0
- data/F24Lec_MD/lec04/lec04-four-plot.md +34 -0
- data/F24Lec_MD/lec05/Lec5-Cobb-Douglas.md +131 -0
- data/F24Lec_MD/lec05/Lec5-CobbD-AER1928.md +283 -0
- data/F24Lec_MD/lec06/6.1-Sympy-Differentiation.md +253 -0
- data/F24Lec_MD/lec06/6.2-3D-utility.md +287 -0
- data/F24Lec_MD/lec06/6.3-QuantEcon-Optimization.md +399 -0
- data/F24Lec_MD/lec06/6.4-latex.md +138 -0
- data/F24Lec_MD/lec06/6.5-Edgeworth.md +269 -0
- data/F24Lec_MD/lec07/7.1-inequality.md +283 -0
- data/F24Lec_MD/lec07/7.2-historical-inequality.md +237 -0
- data/F24Lec_MD/lec08/macro-fred-api.md +313 -0
- data/F24Lec_MD/lec09/lecNB-prisoners-dilemma.md +88 -0
- data/F24Lec_MD/lec10/Lec10.2-waterguard.md +401 -0
- data/F24Lec_MD/lec10/lec10.1-mapping.md +199 -0
- data/F24Lec_MD/lec11/11.1-slr.md +305 -0
- data/F24Lec_MD/lec11/11.2-mlr.md +171 -0
- data/F24Lec_MD/lec12/Lec12-4-PersonalFinance.md +590 -0
- data/F24Lec_MD/lec12/lec12-1_Interest_Payments.md +267 -0
- data/F24Lec_MD/lec12/lec12-2-stocks-options.md +235 -0
- data/F24Lec_MD/lec13/Co2_ClimateChange.md +139 -0
- data/F24Lec_MD/lec13/ConstructingMAC.md +213 -0
- data/F24Lec_MD/lec13/EmissionsTracker.md +170 -0
- data/F24Lec_MD/lec13/KuznetsHypothesis.md +219 -0
- data/F24Lec_MD/lec13/RoslingPlots.md +217 -0
- data/F24Lec_MD/lec15/vibecession.md +485 -0
- data/F24Textbook_MD/00-intro/index.md +292 -0
- data/F24Textbook_MD/01-demand/01-demand.md +152 -0
- data/F24Textbook_MD/01-demand/02-example.md +131 -0
- data/F24Textbook_MD/01-demand/03-log-log.md +284 -0
- data/F24Textbook_MD/01-demand/04-elasticity.md +248 -0
- data/F24Textbook_MD/01-demand/index.md +15 -0
- data/F24Textbook_MD/02-supply/01-supply.md +203 -0
- data/F24Textbook_MD/02-supply/02-eep147-example.md +86 -0
- data/F24Textbook_MD/02-supply/03-sympy.md +138 -0
- data/F24Textbook_MD/02-supply/04-market-equilibria.md +204 -0
- data/F24Textbook_MD/02-supply/index.md +16 -0
- data/F24Textbook_MD/03-public/govt-intervention.md +73 -0
- data/F24Textbook_MD/03-public/index.md +10 -0
- data/F24Textbook_MD/03-public/surplus.md +351 -0
- data/F24Textbook_MD/03-public/taxes-subsidies.md +282 -0
- data/F24Textbook_MD/04-production/index.md +15 -0
- data/F24Textbook_MD/04-production/production.md +178 -0
- data/F24Textbook_MD/04-production/shifts.md +296 -0
- data/F24Textbook_MD/05-utility/budget-constraints.md +166 -0
- data/F24Textbook_MD/05-utility/index.md +15 -0
- data/F24Textbook_MD/05-utility/utility.md +136 -0
- data/F24Textbook_MD/06-inequality/historical-inequality.md +253 -0
- data/F24Textbook_MD/06-inequality/index.md +15 -0
- data/F24Textbook_MD/06-inequality/inequality.md +226 -0
- data/F24Textbook_MD/07-game-theory/bertrand.md +257 -0
- data/F24Textbook_MD/07-game-theory/cournot.md +333 -0
- data/F24Textbook_MD/07-game-theory/equilibria-oligopolies.md +96 -0
- data/F24Textbook_MD/07-game-theory/expected-utility.md +61 -0
- data/F24Textbook_MD/07-game-theory/index.md +19 -0
- data/F24Textbook_MD/07-game-theory/python-classes.md +340 -0
- data/F24Textbook_MD/08-development/index.md +35 -0
- data/F24Textbook_MD/09-macro/CentralBanks.md +101 -0
- data/F24Textbook_MD/09-macro/Indicators.md +77 -0
- data/F24Textbook_MD/09-macro/fiscal_policy.md +36 -0
- data/F24Textbook_MD/09-macro/index.md +14 -0
- data/F24Textbook_MD/09-macro/is_curve.md +76 -0
- data/F24Textbook_MD/09-macro/phillips_curve.md +70 -0
- data/F24Textbook_MD/10-finance/index.md +10 -0
- data/F24Textbook_MD/10-finance/options.md +178 -0
- data/F24Textbook_MD/10-finance/value-interest.md +60 -0
- data/F24Textbook_MD/11-econometrics/index.md +16 -0
- data/F24Textbook_MD/11-econometrics/multivariable.md +218 -0
- data/F24Textbook_MD/11-econometrics/reading-econ-papers.md +25 -0
- data/F24Textbook_MD/11-econometrics/single-variable.md +483 -0
- data/F24Textbook_MD/11-econometrics/statsmodels.md +58 -0
- data/F24Textbook_MD/12-environmental/KuznetsHypothesis-Copy1.md +187 -0
- data/F24Textbook_MD/12-environmental/KuznetsHypothesis.md +187 -0
- data/F24Textbook_MD/12-environmental/MAC.md +254 -0
- data/F24Textbook_MD/12-environmental/index.md +36 -0
- data/F24Textbook_MD/LICENSE.md +11 -0
- data/F24Textbook_MD/intro.md +26 -0
- data/F24Textbook_MD/references.md +25 -0
- data/F24Textbook_MD/summary.yaml +414 -0
- metadata +155 -0
@@ -0,0 +1,483 @@
|
|
1
|
+
---
|
2
|
+
title: single-variable
|
3
|
+
type: textbook
|
4
|
+
source_path: content/11-econometrics/single-variable.ipynb
|
5
|
+
chapter: 11
|
6
|
+
---
|
7
|
+
|
8
|
+
# Single Variable Regression
|
9
|
+
|
10
|
+
```python
|
11
|
+
import numpy as np
|
12
|
+
import pandas as pd
|
13
|
+
import matplotlib.pyplot as plt
|
14
|
+
import statsmodels.api as sm
|
15
|
+
import warnings
|
16
|
+
warnings.simplefilter("ignore")
|
17
|
+
from mpl_toolkits.mplot3d import Axes3D
|
18
|
+
from matplotlib import patches
|
19
|
+
from datascience import *
|
20
|
+
%matplotlib inline
|
21
|
+
from sklearn.metrics import mean_squared_error
|
22
|
+
from ipywidgets import interact, interactive, fixed
|
23
|
+
import ipywidgets as widgets
|
24
|
+
```
|
25
|
+
|
26
|
+
Suppose we have some data that looks like this on a scatter plot:
|
27
|
+
|
28
|
+

|
29
|
+
|
30
|
+
In Data 8 we learned that $x$ and $y$ above have some **correlation coefficient** $r$, which is a measure of the strength of the linear relationship between the two variables.
|
31
|
+
|
32
|
+
It looks like there is some positive linear association between $x$ and $y$ such that larger values of $x$ correspond to larger values of $y$. We therefore expect $r$ to be some positive number between 0 and 1, but not exactly 0 or exactly 1.
|
33
|
+
|
34
|
+
First, let's convert the data (stored in the arrays `x` and `y`) to standard units. To convert a set of data points to standard units, we subtract out the mean of the data and scale by the standard deviation. This has the effect of changing the data so that the data in standard units have mean 0 and standard deviation 1. Below we construct a function that does this.
|
35
|
+
|
36
|
+
$$
|
37
|
+
x_{su} = \dfrac{x - \mu_x}{\sigma_x}
|
38
|
+
$$
|
39
|
+
|
40
|
+
```python
|
41
|
+
def standard_units(array):
|
42
|
+
return (array - np.mean(array)) / np.std(array)
|
43
|
+
```
|
44
|
+
|
45
|
+
```python
|
46
|
+
# Hide this.
|
47
|
+
np.random.seed(42)
|
48
|
+
x = np.random.uniform(0, 10, 100)
|
49
|
+
noise = np.random.randn(100) * 4
|
50
|
+
y = 1.5 * x + noise
|
51
|
+
```
|
52
|
+
|
53
|
+
```python
|
54
|
+
x_standard = standard_units(x)
|
55
|
+
y_standard = standard_units(y)
|
56
|
+
```
|
57
|
+
|
58
|
+
```python
|
59
|
+
plt.figure(figsize=(8,6))
|
60
|
+
plt.scatter(x_standard, y_standard)
|
61
|
+
plt.xlabel('$x$_standard', fontsize = 14)
|
62
|
+
plt.ylabel('$y$_standard', fontsize = 14);
|
63
|
+
```
|
64
|
+
|
65
|
+
The scatter plot looks the same as before, except now the axes are scaled such that we measure $x$ and $y$ in standard units. Now recall that $r$ is calculated as the average of the product of two variables, when the variables are measured in standard units. Below we define a function that calculates $r$, assuming that the inputs have already been converted to standard units.
|
66
|
+
|
67
|
+
```python
|
68
|
+
def correlation(array1, array2):
|
69
|
+
return np.mean(array1 * array2)
|
70
|
+
```
|
71
|
+
|
72
|
+
What is the correlation between these two variables?
|
73
|
+
|
74
|
+
```python
|
75
|
+
correlation(x_standard, y_standard)
|
76
|
+
```
|
77
|
+
|
78
|
+
Recall from Data 8 that we use $r$ to form a line called the *regression line*, which makes predictions for $y$ given some $x$. Our prediction for $y$ in standard units is $r \cdot x$. If we want to fit this regression line in the original units, recall that the slope of this line is given by
|
79
|
+
|
80
|
+
$$
|
81
|
+
\text{slope} = r \cdot \dfrac{\hat{\sigma}_y}{\hat{\sigma}_x}
|
82
|
+
$$
|
83
|
+
|
84
|
+
and the intercept is given by
|
85
|
+
|
86
|
+
$$
|
87
|
+
\text{intercept} = \hat{\mu}_y - \text{slope} \cdot \hat{\mu}_x
|
88
|
+
$$
|
89
|
+
|
90
|
+
where $\hat{\sigma}_x$ is the observed standard deviation of a variable $x$ and $\hat{\mu}_x$ is the observed mean. Our regression line will have the form
|
91
|
+
|
92
|
+
$$
|
93
|
+
y = \hat{\alpha} + \hat{\beta} x
|
94
|
+
$$
|
95
|
+
|
96
|
+
where $\hat{\alpha}$ is the intercept from above and $\hat{\beta}$ the slope.
|
97
|
+
|
98
|
+
Below we plot this line on the same scatter plot from before.
|
99
|
+
|
100
|
+
```python
|
101
|
+
r = correlation(x_standard, y_standard)
|
102
|
+
slope = r * np.std(y) / np.std(x)
|
103
|
+
intercept = np.mean(y) - slope * np.mean(x)
|
104
|
+
```
|
105
|
+
|
106
|
+
```python
|
107
|
+
plt.figure(figsize=(8,6))
|
108
|
+
plt.scatter(x, y)
|
109
|
+
plt.plot(np.linspace(0, 10), slope * np.linspace(0, 10) + intercept, color='tab:orange')
|
110
|
+
plt.xlabel('$x$', fontsize = 14)
|
111
|
+
plt.ylabel('$y$', fontsize = 14);
|
112
|
+
```
|
113
|
+
|
114
|
+
Let's take a closer look at the slope we found.
|
115
|
+
|
116
|
+
```python
|
117
|
+
print('Slope: ', slope)
|
118
|
+
```
|
119
|
+
|
120
|
+
To generate the data above, we started with some range of $x$ values, and generated $y$ as a linear function of $x$ with some random noise added in. Take a look:
|
121
|
+
|
122
|
+
```python
|
123
|
+
np.random.seed(42)
|
124
|
+
x = np.random.uniform(0, 10, 100)
|
125
|
+
noise = np.random.randn(100) * 4
|
126
|
+
y = 1.5 * x + noise
|
127
|
+
```
|
128
|
+
|
129
|
+
Notice how I defined $y$:
|
130
|
+
|
131
|
+
$$
|
132
|
+
y = 1.5 \cdot x + u
|
133
|
+
$$
|
134
|
+
|
135
|
+
where $u$ is some normally-distributed random noise whose average is 0. So, while there is some randomness to the data, on average the "true" slope of the relationship is 1.5. Yet we predicted it to be roughly 1.3!
|
136
|
+
|
137
|
+
This highlights the following fact: Suppose we have some random data that we believe has a linear relationship. The least-squares slope we generate from the data is an *estimate* of the "true" slope of that data. Because of this, the estimated slope is a random variable that depends on the data we happen to have.
|
138
|
+
|
139
|
+
To highlight this fact, let's repeat the procedure above but with a different [random seed](https://en.wikipedia.org/wiki/Random_seed), in order to get data with the same underlying relationship but different values.
|
140
|
+
|
141
|
+
```python
|
142
|
+
np.random.seed(189)
|
143
|
+
x = np.random.uniform(0, 10, 100)
|
144
|
+
noise = np.random.randn(100) * 4
|
145
|
+
y = 1.5 * x + noise
|
146
|
+
|
147
|
+
r = correlation(x_standard, y_standard)
|
148
|
+
slope = r * np.std(y) / np.std(x)
|
149
|
+
intercept = np.mean(y) - slope * np.mean(x)
|
150
|
+
```
|
151
|
+
|
152
|
+
```python
|
153
|
+
plt.figure(figsize=(8,6))
|
154
|
+
plt.scatter(x, y)
|
155
|
+
plt.plot(np.linspace(0, 10), slope * np.linspace(0, 10) + intercept, color='tab:orange')
|
156
|
+
plt.xlabel('$x$', fontsize = 14)
|
157
|
+
plt.ylabel('$y$', fontsize = 14);
|
158
|
+
print('Slope: ', slope)
|
159
|
+
```
|
160
|
+
|
161
|
+
Now the estimated slope is roughly 1.6, even though the underlying data was still generated using a slope of 1.5. This is a very important concept that we will revisit soon.
|
162
|
+
|
163
|
+
Keep in mind, however, that correlation in data *does not* imply causation. In this example we know the true causal relationship between $x$ and $y$ because we defined it ourselves. However, when using real data you do not see the "true" relation and thus cannot conclude causality from correlation. It could simply be that both your variables depend on an unseen third variable and have no causal effect on one another. Or even worse, while unlikely it could be the case that slight linear trends in two variables is a complete coincidence.
|
164
|
+
|
165
|
+
## Root-Mean-Squared Error
|
166
|
+
|
167
|
+
While we can arbitrarily pick $\hat{\alpha}$ and $\hat{\beta}$ values, we do want to pick the values that help predict $\hat{y}$ that are closest to actual $y$ values. To achieve this, we want to minimize a **loss function** that quantifies how far off our prediction $\hat{y}$ is from $y$ for some known data points. One of the most common loss functions is called the **root-mean-squared error**, and is defined as
|
168
|
+
|
169
|
+
$$
|
170
|
+
\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^n \left ( y_i - \hat{y}_i \right ) ^2 }
|
171
|
+
$$
|
172
|
+
|
173
|
+
where $n$ is the number of observations. The effect of this is to take the mean of the distance of each value of $\hat{y}$ from its corresponding value in $y$; squaring these values keeps them positive, and then we take the square root to correct the units of the error.
|
174
|
+
|
175
|
+
Plugging in the formula $\hat{y}$ in RMSE formula, we get,
|
176
|
+
|
177
|
+
$$
|
178
|
+
\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^n \left ( y_i - (\hat{\alpha} + \hat{\beta}x_i) \right ) ^2 }
|
179
|
+
$$
|
180
|
+
|
181
|
+
By doing a bit of calculus, we get the following formulas for $\hat{\alpha}$ and $\hat{\beta}$
|
182
|
+
|
183
|
+
$$\Large
|
184
|
+
\hat{\beta} = r\frac {\hat{\sigma}_y} {\hat{\sigma}_x} \qquad \qquad
|
185
|
+
\hat{\alpha} = \hat{\mu}_y - \hat{\beta}\hat{\mu}_x
|
186
|
+
$$
|
187
|
+
|
188
|
+
where $r$ is the **correlation** between $x$ and $y$, $\hat{\sigma}_y$ is the standard deviation of $y$, $\hat{\sigma}_x$ is the standard deviation of $x$, $\hat{\mu}_y$ is the average of all our $y$ values, and $\hat{\mu}_x$ is the average of all our $x$ values. (As an aside, note the hats on our $\sigma$'s and $\mu$'s; this is because these are _empirical estimates_ of the parameters of these distributions, rather than the true values.) These are the same values we had above!
|
189
|
+
|
190
|
+
Note that our formula for $\hat{\beta}$ involves the **correlation coefficient** $r$ of $x$ and $y$. The correlation coefficient of two variables is a measure of the strength of a linear relationship between them. $r$ goes from -1 to 1, where $|r|=1$ is a perfect linear relationship and $r=0$ is no linear relationship. The formula for $r$ is
|
191
|
+
|
192
|
+
$$
|
193
|
+
r = \frac{1}{n}\sum^n_{i=1} \left ( \frac{x_i - \hat{\mu}_x}{\hat{\sigma}_x} \right ) \left ( \frac{y_i - \hat{\mu}_y}{\hat{\sigma}_y} \right )
|
194
|
+
$$
|
195
|
+
|
196
|
+
(Note: the form $\frac{x_i - \hat{\mu}_x}{\hat{\sigma}_x}$ of a variable $x$ is it's representation in standard units, as mentioned above.)
|
197
|
+
|
198
|
+
To calculate the RMSE, we will write an `rmse` function that makes use of sklearn's `mean_squared_error` function.
|
199
|
+
|
200
|
+
```python
|
201
|
+
def rmse(target, pred):
|
202
|
+
return np.sqrt(mean_squared_error(target, pred))
|
203
|
+
```
|
204
|
+
|
205
|
+
To get a better idea of what the RMSE represents, the figures below show a small dataset, a proposed regression line, and the squared error that we are summing in the RMSE. The data points are
|
206
|
+
|
207
|
+
| $x$ | $y$ |
|
208
|
+
|-----|-----|
|
209
|
+
| 0 | 1 |
|
210
|
+
| 1 | .5 |
|
211
|
+
| 2 | -1 |
|
212
|
+
| 3 | 2 |
|
213
|
+
| 4 | -3 |
|
214
|
+
|
215
|
+
Here are the proposed regression lines and their errors:
|
216
|
+
|
217
|
+

|
218
|
+
|
219
|
+

|
220
|
+
|
221
|
+
```python
|
222
|
+
d = Table().with_columns(
|
223
|
+
'x', make_array(0, 1, 2, 3, 4),
|
224
|
+
'y', make_array(1, .5, -1, 2, -3))
|
225
|
+
|
226
|
+
def plot_line_and_errors(slope, intercept):
|
227
|
+
print("RMSE:", rmse(slope * d.column('x') + intercept, d.column('y')))
|
228
|
+
plt.figure(figsize=(5,5))
|
229
|
+
points = make_array(-2, 7)
|
230
|
+
p = plt.plot(points, slope*points + intercept, color='orange', label='Proposed line')
|
231
|
+
ax = p[0].axes
|
232
|
+
|
233
|
+
predicted_ys = slope*d.column('x') + intercept
|
234
|
+
diffs = predicted_ys - d.column('y')
|
235
|
+
for i in np.arange(d.num_rows):
|
236
|
+
x = d.column('x').item(i)
|
237
|
+
y = d.column('y').item(i)
|
238
|
+
diff = diffs.item(i)
|
239
|
+
|
240
|
+
if diff > 0:
|
241
|
+
bottom_left_x = x
|
242
|
+
bottom_left_y = y
|
243
|
+
else:
|
244
|
+
bottom_left_x = x + diff
|
245
|
+
bottom_left_y = y + diff
|
246
|
+
|
247
|
+
ax.add_patch(patches.Rectangle(make_array(bottom_left_x, bottom_left_y), abs(diff), abs(diff), color='red', alpha=.3, label=('Squared error' if i == 0 else None)))
|
248
|
+
plt.plot(make_array(x, x), make_array(y, y + diff), color='red', alpha=.6, label=('Error' if i == 0 else None))
|
249
|
+
|
250
|
+
plt.scatter(d.column('x'), d.column('y'), color='blue', label='Points')
|
251
|
+
|
252
|
+
plt.xlim(-4, 8)
|
253
|
+
plt.ylim(-6, 6)
|
254
|
+
plt.gca().set_aspect('equal', adjustable='box')
|
255
|
+
|
256
|
+
plt.legend(bbox_to_anchor=(1.8, .8))
|
257
|
+
plt.show()
|
258
|
+
|
259
|
+
interact(plot_line_and_errors, slope=widgets.FloatSlider(min=-4, max=4, step=.1), intercept=widgets.FloatSlider(min=-4, max=4, step=.1));
|
260
|
+
```
|
261
|
+
|
262
|
+
## Econometric Single Variable Regression
|
263
|
+
|
264
|
+
The regression line can have two purposes:
|
265
|
+
|
266
|
+
* Of particular interest to data scientists is the line's ability to predict values of $y$ for new values of $x$ that we didn't see before.
|
267
|
+
|
268
|
+
* Of particular interest to economists is the line's ability to estimate the "true" underlying slope of the data via its slope.
|
269
|
+
|
270
|
+
This is why regression is such a powerful tool and forms the backbone of econometrics. If we believe that our data satisfy certain assumptions (which we won't explore too much this lecture), then we can use the slope of the regression line to estimate the "true" relation between the variables in question and learn more about the world we live in.
|
271
|
+
|
272
|
+
In econometrics, we usually write the "true" underlying linear relationship as follows:
|
273
|
+
|
274
|
+
$$
|
275
|
+
y = \alpha + \beta \cdot x + \varepsilon
|
276
|
+
$$
|
277
|
+
|
278
|
+
where $y$ and $x$ are values for any arbitrary point, $\alpha$ is the intercept, $\beta$ is the slope, and $\varepsilon$ is some noise. This is entirely analogous to the code from earlier that determined the true linear relationship between the data:
|
279
|
+
|
280
|
+
```python
|
281
|
+
y = 1.5 * x + noise
|
282
|
+
```
|
283
|
+
|
284
|
+
Here, $\beta = 1.5$, $\alpha = 0$, and $\varepsilon = \text{noise}$.
|
285
|
+
|
286
|
+
When we fit a regression line onto the data, we express the line as:
|
287
|
+
|
288
|
+
$$
|
289
|
+
\hat{y} = \hat{\alpha} + \hat{\beta} \cdot x
|
290
|
+
$$
|
291
|
+
|
292
|
+
Here, we put hats over the slope and intercept terms because they are *estimates* of the true slope and intercept terms. Similarly, we put a hat over $y$ because this is the $y$ value that the regression line predicts.
|
293
|
+
|
294
|
+
Notice how the noise term $\varepsilon$ does not appear in the expression for the regression line. This is because the noise term is a random variable that has no relation with $x$, and is thus impossible to predict from the data. Furthermore, the noise term has a mean value of 0, so on average we actually don't expect the noise term to have any impact on the underlying trends of the data.
|
295
|
+
|
296
|
+
For the Data 8 demonstration above, we forced these conditions to be true. However, with real data these are assumptions that we have to make, and is something that econometricians spend a lot of time thinking about.
|
297
|
+
|
298
|
+
### Years of Schooling and Earnings
|
299
|
+
|
300
|
+
Consider a case where we want to study how years of schooling relate to a person's earnings. This should be of particular interest to college students. Below we import a dataset that has the hourly wage, years of schooling, and other information on thousands of people sampled in the March 2012 Current Population Survey.
|
301
|
+
|
302
|
+
```python
|
303
|
+
cps = Table.read_table('cps.csv')
|
304
|
+
cps
|
305
|
+
```
|
306
|
+
|
307
|
+
We want to consider a person's wage and years of schooling. But first, we will convert wage to log-wage. Wage is a variable that we would expect to increase proportionally (or, exponentially) with changes in years of schooling. And as such, we usually take the natural log of wage instead. Below we plot log wage and years of schooling for the CPS data.
|
308
|
+
|
309
|
+
```python
|
310
|
+
educ = cps.column('educ')
|
311
|
+
logwage = cps.column('logwage')
|
312
|
+
|
313
|
+
plt.figure(figsize=(8,6))
|
314
|
+
plt.scatter(educ, logwage)
|
315
|
+
plt.xlabel('Years of Education', fontsize = 14)
|
316
|
+
plt.ylabel('Log Wage', fontsize = 14)
|
317
|
+
plt.title('Log Wage vs. Years of Education', size = 20);
|
318
|
+
```
|
319
|
+
|
320
|
+
Now let's fit a least-squares regression line onto this data. First, we'll do it manually in the Data 8 style above.
|
321
|
+
|
322
|
+
```python
|
323
|
+
educ_standard = standard_units(educ)
|
324
|
+
logwage_standard = standard_units(logwage)
|
325
|
+
|
326
|
+
r = correlation(logwage_standard, educ_standard)
|
327
|
+
slope = r * np.std(logwage) / np.std(educ)
|
328
|
+
intercept = np.mean(logwage) - slope * np.mean(educ)
|
329
|
+
```
|
330
|
+
|
331
|
+
```python
|
332
|
+
plt.figure(figsize=(8,6))
|
333
|
+
plt.scatter(educ, logwage)
|
334
|
+
plt.plot(np.linspace(0, 20), slope * np.linspace(0, 20) + intercept, color='tab:orange')
|
335
|
+
plt.xlabel('Years of Education', fontsize = 14)
|
336
|
+
plt.ylabel('Log Wage', fontsize = 14)
|
337
|
+
plt.title('Log Wage vs. Years of Education', size = 20);
|
338
|
+
print('Slope: ', slope)
|
339
|
+
print('Intercept: ', intercept)
|
340
|
+
```
|
341
|
+
|
342
|
+
So from the very simple and straight-forward model above, it seems that we estimate a slope of roughly 0.1, meaning we might expect that a one-year increase in schooling is associated with a 10% increase in wage, on average.
|
343
|
+
|
344
|
+
We can also see that we have a non-zero intercept term. We should be careful how we interpret this term; from a strictly mathematical point of view, the intercept represents the expected value of $y$ (in this case log wage) when $x = 0$. However, in economics sometimes it makes no sense for $x$ to be 0, and so we cannot use the above interpretation. We won't go into detail this lecture, but regardless of whether the intercept is interpretable, we almost always want to include it.
|
345
|
+
|
346
|
+
## Uncertainty in $\hat{\beta}$
|
347
|
+
|
348
|
+
We mentioned earlier that the slope we estimate from regression is exactly that: an estimate of the "true" underlying slope. Because of this, the estimate $\hat{\beta}$ is a random variable that depends on the underlying data.
|
349
|
+
|
350
|
+
Let's assume there is the following true linear relation between log wage and years of schooling,
|
351
|
+
|
352
|
+
$$
|
353
|
+
\text{logwage} = \alpha + \beta \cdot \text{years of schooling} + \varepsilon
|
354
|
+
$$
|
355
|
+
|
356
|
+
and we try to estimate $\alpha$ and $\beta$.
|
357
|
+
|
358
|
+
If our data are "well-behaved", then even though there is uncertainty in our estimate $\hat{\beta}$, on average $\hat{\beta}$ will be $\beta$; that is to say that the expectation of $\hat{\beta}$ is $\beta$. Additionally, if our data are "well-behaved", then $\hat{\beta}$ has some normal distribution with mean $\beta$. We won't worry too much about what assumptions need to be satisfied to make the data "well-behaved".
|
359
|
+
|
360
|
+
You can think of each person as an observation of these variables, and using a sample of people we can estimate the relationship between the two variables. However, due to the noise term and the fact that we only have a finite sample of people, the true relationship is always hidden from us, and we can only hope to get better estimates by designing better experiments and sampling more people.
|
361
|
+
|
362
|
+
Let's try to get an idea of how "certain" we can be of our estimate $\hat{\beta}$. We'll do this in classic Data 8 style: bootstrapping. Using our existing sample data, we'll create new samples by bootstrapping from the existing data. Then, for each sample, we'll fit a line, and keep the slope of that line in a list with all of the other slopes. Then, we'll find the standard deviation of that list of slopes.
|
363
|
+
|
364
|
+
```python
|
365
|
+
slopes = make_array()
|
366
|
+
educ_logwage = cps.select("educ", "logwage")
|
367
|
+
|
368
|
+
np.random.seed(42)
|
369
|
+
for i in np.arange(200):
|
370
|
+
educ_logwage_sample = educ_logwage.sample()
|
371
|
+
y = educ_logwage_sample.column("logwage")
|
372
|
+
X = educ_logwage_sample.column("educ")
|
373
|
+
model = sm.OLS(y, sm.add_constant(X)).fit()
|
374
|
+
slopes = np.append(model.params[1], slopes)
|
375
|
+
|
376
|
+
Table().with_columns("Slopes", slopes).hist()
|
377
|
+
print('Standard dev. of bootstrapped slopes: ', np.std(slopes))
|
378
|
+
```
|
379
|
+
|
380
|
+
Our bootstrapped approximation standard error of 0.00159 is pretty close to the true standard error of 0.00144. `statsmodels`, the package we will be using to perform regressions, actually uses a precise mathematical formula for finding the standard error whereas we tried to find this value through simulation, but the idea behind the standard error is the same.
|
381
|
+
|
382
|
+
Armed with a standard error, we can now form a 95% confidence interval and perform a test of significance to see if $\hat{\beta}$ is significantly different from 0.
|
383
|
+
|
384
|
+
```python
|
385
|
+
# Using our resampled slopes
|
386
|
+
lower_bound = percentile(2.5, slopes)
|
387
|
+
upper_bound = percentile(97.5, slopes)
|
388
|
+
print('95% confidence interval: [{}, {}]'.format(lower_bound, upper_bound))
|
389
|
+
```
|
390
|
+
|
391
|
+
The 95% confidence interval does not contain 0, and so $\beta$ is unlikely to be 0.
|
392
|
+
|
393
|
+
## Regression with a Binary Variable
|
394
|
+
|
395
|
+
A binary variable is a variable that takes on the value of 1 if some condition is true, and 0 otherwise. These are also called dummy variables or indicator variables. It might sound strange at first, but you can actually perform regression of a variable like log earnings onto a binary variable.
|
396
|
+
|
397
|
+
Let's import a different dataset that has the following features. Some will be useful to us later.
|
398
|
+
|
399
|
+
```python
|
400
|
+
nlsy = Table.read_table('nlsy_cleaned_small.csv')
|
401
|
+
nlsy
|
402
|
+
```
|
403
|
+
|
404
|
+
Now let's visualize log earnings vs. the binary variable corresponding to whether or not an observation went to college.
|
405
|
+
|
406
|
+
```python
|
407
|
+
coll = nlsy.column('college')
|
408
|
+
logearn = nlsy.column('log_earn_1999')
|
409
|
+
|
410
|
+
plt.figure(figsize=(8,6))
|
411
|
+
plt.scatter(coll, logearn)
|
412
|
+
plt.xlabel('College', fontsize = 14)
|
413
|
+
plt.ylabel('Log Earnings', fontsize = 14)
|
414
|
+
plt.title('Log Earnings vs. College Completion', size = 20)
|
415
|
+
plt.xticks([0,1]);
|
416
|
+
```
|
417
|
+
|
418
|
+
```python
|
419
|
+
no_college = nlsy.where('college', 0).column("log_earn_1999")
|
420
|
+
has_college = nlsy.where('college', 1).column("log_earn_1999")
|
421
|
+
|
422
|
+
plt.figure(figsize=(8,6))
|
423
|
+
plt.xlabel('College', fontsize = 14)
|
424
|
+
plt.ylabel('Log Earnings', fontsize = 14)
|
425
|
+
plt.title('Log Earnings vs. College Completion', size = 20)
|
426
|
+
plt.violinplot(no_college, positions = [0], points=20, widths=0.3, showmeans=True, showextrema=True, showmedians=False)
|
427
|
+
plt.violinplot(has_college, positions = [1], points=20, widths=0.3, showmeans=True, showextrema=True, showmedians=False);
|
428
|
+
```
|
429
|
+
|
430
|
+
Now let's fit a regression model:
|
431
|
+
|
432
|
+
```python
|
433
|
+
coll_standard = standard_units(coll)
|
434
|
+
logearn_standard = standard_units(logearn)
|
435
|
+
|
436
|
+
r = correlation(logearn_standard, coll_standard)
|
437
|
+
slope = r * np.std(logearn) / np.std(coll)
|
438
|
+
intercept = np.mean(logearn) - slope * np.mean(coll)
|
439
|
+
|
440
|
+
print("y = {:.5f} * x + {:.5f}".format(slope, intercept))
|
441
|
+
```
|
442
|
+
|
443
|
+
Wow! This regression would imply that we expect, on average, observations who went to college to have 70% higher earnings than those who did not go to college. Let's now plot this line on the data:
|
444
|
+
|
445
|
+
```python
|
446
|
+
plt.figure(figsize=(8,6))
|
447
|
+
plt.scatter(coll, logearn)
|
448
|
+
plt.plot(np.linspace(0, 1), slope * np.linspace(0, 1) + intercept, color='tab:orange')
|
449
|
+
plt.xlabel('College', fontsize = 14)
|
450
|
+
plt.ylabel('Log Earnings', fontsize = 14)
|
451
|
+
plt.title('Log Earnings vs. College Completion', size = 20)
|
452
|
+
plt.xticks([0,1]);
|
453
|
+
```
|
454
|
+
|
455
|
+
```python
|
456
|
+
no_college = nlsy.where('college', 0).column("log_earn_1999")
|
457
|
+
has_college = nlsy.where('college', 1).column("log_earn_1999")
|
458
|
+
|
459
|
+
plt.figure(figsize=(8,6))
|
460
|
+
plt.plot(np.linspace(0, 1), slope * np.linspace(0, 1) + intercept, color='tab:green')
|
461
|
+
plt.xlabel('College', fontsize = 14)
|
462
|
+
plt.ylabel('Log Earnings', fontsize = 14)
|
463
|
+
plt.title('Log Earnings vs. College Completion', size = 20)
|
464
|
+
plt.violinplot(no_college, positions = [0], points=20, widths=0.3, showmeans=True, showextrema=True, showmedians=False)
|
465
|
+
plt.violinplot(has_college, positions = [1], points=20, widths=0.3, showmeans=True, showextrema=True, showmedians=False);
|
466
|
+
```
|
467
|
+
|
468
|
+
When we perform a simple regression onto just a dummy variable, it is an important fact that $\hat{\alpha}$ is the mean value of $y$ for all observations in the sample where $x = 0$, and $\hat{\beta}$ is the difference between the mean value of $y$ for observations in the sample where $x = 1$ and observations where $x = 0$. Proving this claim is beyond our scope this week, but let's verify it with our data:
|
469
|
+
|
470
|
+
```python
|
471
|
+
avg_logearn_coll = np.mean(logearn[coll == 1])
|
472
|
+
avg_logearn_nocoll = np.mean(logearn[coll == 0])
|
473
|
+
|
474
|
+
print('Avg logearn for coll = 1: ', avg_logearn_coll)
|
475
|
+
print('Avg logearn for coll = 0: ', avg_logearn_nocoll)
|
476
|
+
print('Difference between the two: ', avg_logearn_coll - avg_logearn_nocoll)
|
477
|
+
```
|
478
|
+
|
479
|
+
```python
|
480
|
+
print('Intercept: ', intercept)
|
481
|
+
print('Slope: ', slope)
|
482
|
+
```
|
483
|
+
|
@@ -0,0 +1,58 @@
|
|
1
|
+
---
|
2
|
+
title: statsmodels
|
3
|
+
type: textbook
|
4
|
+
source_path: content/11-econometrics/statsmodels.ipynb
|
5
|
+
chapter: 11
|
6
|
+
---
|
7
|
+
|
8
|
+
# Using `statsmodels` for Regression
|
9
|
+
|
10
|
+
```python
|
11
|
+
from datascience import *
|
12
|
+
import numpy as np
|
13
|
+
import statsmodels.api as sm
|
14
|
+
|
15
|
+
cps = Table.read_table('cps.csv')
|
16
|
+
```
|
17
|
+
|
18
|
+
In the previous section, we used functions in NumPy and concepts taught in Data 8 to perform single variable regressions. It turns out that there are (several) Python packages that can perform these regressions for us and which extend nicely into the types of regressions we will cover in the next few sections. In this section, we introduce `statsmodels` for performing single variable regressions, a foundation upon which we will build our discussion of multivariable regression.
|
19
|
+
|
20
|
+
`statsmodels` is a popular Python package used to create and analyze various statistical models. To create a linear regression model in `statsmodels`, which is generally import as `sm`, we use the following skeleton code:
|
21
|
+
|
22
|
+
```python
|
23
|
+
x = data.select(features).values # Separate features (independent variables)
|
24
|
+
y = data.select(target).values # Separate target (outcome variable)
|
25
|
+
model = sm.OLS(y, sm.add_constant(x)) # Initialize the OLS regression model
|
26
|
+
result = model.fit() # Fit the regression model and save it to a variable
|
27
|
+
result.summary() # Display a summary of results
|
28
|
+
```
|
29
|
+
|
30
|
+
*You must manually add a constant column of all 1's to your independent features. `statsmodels` will not do this for you and if you fail to do this you will perform a regression without an intercept $\alpha$ term. This is performed in the third line by calling `sm.add_constant` on `x`.* Also note that we call `.values` after we select the columns in `x` and `y`; this gives us `NumPy` arrays containing the corresponding values, since `statsmodels` can't process `Table`s.
|
31
|
+
|
32
|
+
Recall the `cps` dataset we used in the previous section:
|
33
|
+
|
34
|
+
```python
|
35
|
+
cps
|
36
|
+
```
|
37
|
+
|
38
|
+
Let's use `statsmodels` to perform our regression of `logwage` on `educ` again.
|
39
|
+
|
40
|
+
```python
|
41
|
+
x = cps.select("educ").values
|
42
|
+
y = cps.select("logwage").values
|
43
|
+
|
44
|
+
model = sm.OLS(y, sm.add_constant(x))
|
45
|
+
results = model.fit()
|
46
|
+
results.summary()
|
47
|
+
```
|
48
|
+
|
49
|
+
The summary above provides us with a lot of information. Let's start with the most important pieces: the values of $\hat{\alpha}$ and $\hat{\beta}$. The middle table contains these values for us as `const` and `x1`'s `coef` values: we have $\hat{\alpha}$ is 1.4723 and $\hat{\beta}$ is 0.1078.
|
50
|
+
|
51
|
+
Recall also our discussion of uncertainty in $\hat{\beta}$. `statsmodels` provides us with our calculated standard error in the `std err` column, and we see that the standard error of $\hat{\beta}$ is 0.001, matching our empirical estimate via bootstrapping from the last section. We can also see the 95% confidence interval that we calculated in the two rightmost columns.
|
52
|
+
|
53
|
+

|
54
|
+
|
55
|
+
Earlier we said that $\hat{\beta}$ has some normal distribution with mean $\beta$ if certain assumptions are satisfied. We now can see that the standard deviation of that normal distribution is the standard error of $\hat{\beta}$. We can also use this to test a null hypothesis that $\beta = 0$. To do so, construct a [t-statistic](https://en.wikipedia.org/wiki/T-statistic) (which `statsmodels` does for you) that indicates how many standard deviations away $\hat{\beta}$ is from 0, assuming that the distribution of $\hat{\beta}$ is in fact centered at 0.
|
56
|
+
|
57
|
+
We can see that $\hat{\beta}$ is 74 standard deviations away from the null hypothesis mean of 0, which is an enormous number. How likely do you think it is to draw a random number roughly 74 standard deviations away from the mean, assuming a standard normal distribution? Essentially 0. This is strong evidence that the mean of the distribution (the mean of $\hat{\beta}$ is the true value $\beta$) is not 0. Accompanying the t-statistic is a p-value that indicates the statistical significance.
|
58
|
+
|