my-markdown-library 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/F24LS_md/ Lecture 4 - Public.md +347 -0
- data/F24LS_md/Lecture 1 - Introduction and Overview.md +327 -0
- data/F24LS_md/Lecture 10 - Development_.md +631 -0
- data/F24LS_md/Lecture 11 - Econometrics.md +345 -0
- data/F24LS_md/Lecture 12 - Finance.md +692 -0
- data/F24LS_md/Lecture 13 - Environmental Economics.md +299 -0
- data/F24LS_md/Lecture 15 - Conclusion.md +272 -0
- data/F24LS_md/Lecture 2 - Demand.md +349 -0
- data/F24LS_md/Lecture 3 - Supply.md +329 -0
- data/F24LS_md/Lecture 5 - Production C-D.md +291 -0
- data/F24LS_md/Lecture 6 - Utility and Latex.md +440 -0
- data/F24LS_md/Lecture 7 - Inequality.md +607 -0
- data/F24LS_md/Lecture 8 - Macroeconomics.md +704 -0
- data/F24LS_md/Lecture 8 - Macro.md +700 -0
- data/F24LS_md/Lecture 9 - Game Theory_.md +436 -0
- data/F24LS_md/summary.yaml +105 -0
- data/F24Lec_MD/LecNB_summary.yaml +206 -0
- data/F24Lec_MD/lec01/lec01.md +267 -0
- data/F24Lec_MD/lec02/Avocados_demand.md +425 -0
- data/F24Lec_MD/lec02/Demand_Steps_24.md +126 -0
- data/F24Lec_MD/lec02/PriceElasticity.md +83 -0
- data/F24Lec_MD/lec02/ScannerData_Beer.md +171 -0
- data/F24Lec_MD/lec02/demand-curve-Fa24.md +213 -0
- data/F24Lec_MD/lec03/3.0-CubicCostCurve.md +239 -0
- data/F24Lec_MD/lec03/3.1-Supply.md +274 -0
- data/F24Lec_MD/lec03/3.2-sympy.md +332 -0
- data/F24Lec_MD/lec03/3.3a-california-energy.md +120 -0
- data/F24Lec_MD/lec03/3.3b-a-really-hot-tuesday.md +121 -0
- data/F24Lec_MD/lec04/lec04-CSfromSurvey-closed.md +335 -0
- data/F24Lec_MD/lec04/lec04-CSfromSurvey.md +331 -0
- data/F24Lec_MD/lec04/lec04-Supply-Demand-closed.md +519 -0
- data/F24Lec_MD/lec04/lec04-Supply-Demand.md +514 -0
- data/F24Lec_MD/lec04/lec04-four-plot-24.md +34 -0
- data/F24Lec_MD/lec04/lec04-four-plot.md +34 -0
- data/F24Lec_MD/lec05/Lec5-Cobb-Douglas.md +131 -0
- data/F24Lec_MD/lec05/Lec5-CobbD-AER1928.md +283 -0
- data/F24Lec_MD/lec06/6.1-Sympy-Differentiation.md +253 -0
- data/F24Lec_MD/lec06/6.2-3D-utility.md +287 -0
- data/F24Lec_MD/lec06/6.3-QuantEcon-Optimization.md +399 -0
- data/F24Lec_MD/lec06/6.4-latex.md +138 -0
- data/F24Lec_MD/lec06/6.5-Edgeworth.md +269 -0
- data/F24Lec_MD/lec07/7.1-inequality.md +283 -0
- data/F24Lec_MD/lec07/7.2-historical-inequality.md +237 -0
- data/F24Lec_MD/lec08/macro-fred-api.md +313 -0
- data/F24Lec_MD/lec09/lecNB-prisoners-dilemma.md +88 -0
- data/F24Lec_MD/lec10/Lec10.2-waterguard.md +401 -0
- data/F24Lec_MD/lec10/lec10.1-mapping.md +199 -0
- data/F24Lec_MD/lec11/11.1-slr.md +305 -0
- data/F24Lec_MD/lec11/11.2-mlr.md +171 -0
- data/F24Lec_MD/lec12/Lec12-4-PersonalFinance.md +590 -0
- data/F24Lec_MD/lec12/lec12-1_Interest_Payments.md +267 -0
- data/F24Lec_MD/lec12/lec12-2-stocks-options.md +235 -0
- data/F24Lec_MD/lec13/Co2_ClimateChange.md +139 -0
- data/F24Lec_MD/lec13/ConstructingMAC.md +213 -0
- data/F24Lec_MD/lec13/EmissionsTracker.md +170 -0
- data/F24Lec_MD/lec13/KuznetsHypothesis.md +219 -0
- data/F24Lec_MD/lec13/RoslingPlots.md +217 -0
- data/F24Lec_MD/lec15/vibecession.md +485 -0
- data/F24Textbook_MD/00-intro/index.md +292 -0
- data/F24Textbook_MD/01-demand/01-demand.md +152 -0
- data/F24Textbook_MD/01-demand/02-example.md +131 -0
- data/F24Textbook_MD/01-demand/03-log-log.md +284 -0
- data/F24Textbook_MD/01-demand/04-elasticity.md +248 -0
- data/F24Textbook_MD/01-demand/index.md +15 -0
- data/F24Textbook_MD/02-supply/01-supply.md +203 -0
- data/F24Textbook_MD/02-supply/02-eep147-example.md +86 -0
- data/F24Textbook_MD/02-supply/03-sympy.md +138 -0
- data/F24Textbook_MD/02-supply/04-market-equilibria.md +204 -0
- data/F24Textbook_MD/02-supply/index.md +16 -0
- data/F24Textbook_MD/03-public/govt-intervention.md +73 -0
- data/F24Textbook_MD/03-public/index.md +10 -0
- data/F24Textbook_MD/03-public/surplus.md +351 -0
- data/F24Textbook_MD/03-public/taxes-subsidies.md +282 -0
- data/F24Textbook_MD/04-production/index.md +15 -0
- data/F24Textbook_MD/04-production/production.md +178 -0
- data/F24Textbook_MD/04-production/shifts.md +296 -0
- data/F24Textbook_MD/05-utility/budget-constraints.md +166 -0
- data/F24Textbook_MD/05-utility/index.md +15 -0
- data/F24Textbook_MD/05-utility/utility.md +136 -0
- data/F24Textbook_MD/06-inequality/historical-inequality.md +253 -0
- data/F24Textbook_MD/06-inequality/index.md +15 -0
- data/F24Textbook_MD/06-inequality/inequality.md +226 -0
- data/F24Textbook_MD/07-game-theory/bertrand.md +257 -0
- data/F24Textbook_MD/07-game-theory/cournot.md +333 -0
- data/F24Textbook_MD/07-game-theory/equilibria-oligopolies.md +96 -0
- data/F24Textbook_MD/07-game-theory/expected-utility.md +61 -0
- data/F24Textbook_MD/07-game-theory/index.md +19 -0
- data/F24Textbook_MD/07-game-theory/python-classes.md +340 -0
- data/F24Textbook_MD/08-development/index.md +35 -0
- data/F24Textbook_MD/09-macro/CentralBanks.md +101 -0
- data/F24Textbook_MD/09-macro/Indicators.md +77 -0
- data/F24Textbook_MD/09-macro/fiscal_policy.md +36 -0
- data/F24Textbook_MD/09-macro/index.md +14 -0
- data/F24Textbook_MD/09-macro/is_curve.md +76 -0
- data/F24Textbook_MD/09-macro/phillips_curve.md +70 -0
- data/F24Textbook_MD/10-finance/index.md +10 -0
- data/F24Textbook_MD/10-finance/options.md +178 -0
- data/F24Textbook_MD/10-finance/value-interest.md +60 -0
- data/F24Textbook_MD/11-econometrics/index.md +16 -0
- data/F24Textbook_MD/11-econometrics/multivariable.md +218 -0
- data/F24Textbook_MD/11-econometrics/reading-econ-papers.md +25 -0
- data/F24Textbook_MD/11-econometrics/single-variable.md +483 -0
- data/F24Textbook_MD/11-econometrics/statsmodels.md +58 -0
- data/F24Textbook_MD/12-environmental/KuznetsHypothesis-Copy1.md +187 -0
- data/F24Textbook_MD/12-environmental/KuznetsHypothesis.md +187 -0
- data/F24Textbook_MD/12-environmental/MAC.md +254 -0
- data/F24Textbook_MD/12-environmental/index.md +36 -0
- data/F24Textbook_MD/LICENSE.md +11 -0
- data/F24Textbook_MD/intro.md +26 -0
- data/F24Textbook_MD/references.md +25 -0
- data/F24Textbook_MD/summary.yaml +414 -0
- metadata +155 -0
@@ -0,0 +1,305 @@
|
|
1
|
+
---
|
2
|
+
title: "11.1-slr"
|
3
|
+
type: lecture-notebook
|
4
|
+
week: 11
|
5
|
+
source_path: "/Users/ericvandusen/Documents/Data88E-ForTraining/F24Lec_NBs/lec11/11.1-slr.ipynb"
|
6
|
+
---
|
7
|
+
|
8
|
+
<table style="width: 100%;" id="nb-header">
|
9
|
+
<tr style="background-color: transparent;"><td>
|
10
|
+
<img src="https://data-88e.github.io/assets/images/blue_text.png" width="250px" style="margin-left: 0;" />
|
11
|
+
</td><td>
|
12
|
+
<p style="text-align: right; font-size: 10pt;"><strong>Economic Models</strong>, Fall 24<br>
|
13
|
+
Dr. Eric Van Dusen <br>
|
14
|
+
Vaidehi Bulusu <br>
|
15
|
+
<br>
|
16
|
+
</table>
|
17
|
+
|
18
|
+
# Lecture Notebook 11.1: Simple Linear Regression
|
19
|
+
|
20
|
+
```python
|
21
|
+
import numpy as np
|
22
|
+
import pandas as pd
|
23
|
+
import matplotlib.pyplot as plt
|
24
|
+
from matplotlib import patches
|
25
|
+
import statsmodels.api as sm
|
26
|
+
import warnings
|
27
|
+
warnings.simplefilter("ignore")
|
28
|
+
from datascience import *
|
29
|
+
%matplotlib inline
|
30
|
+
from sklearn.metrics import mean_squared_error
|
31
|
+
import ipywidgets as widgets
|
32
|
+
from ipywidgets import interact
|
33
|
+
```
|
34
|
+
|
35
|
+
# Simple Linear Regression
|
36
|
+
|
37
|
+
Let's learn about simple linear regression using a dataset you're already familiar with from week 2: the dataset on the price and quantity demanded of avocados. Recall that in week 2, we used `np.polyfit()` to find the slope and intercept. Here, we'll use regression to calculate the slope and intercept from scratch and get the same results!
|
38
|
+
|
39
|
+
Go over this lecture notebook along with the slides!
|
40
|
+
|
41
|
+
```python
|
42
|
+
av_df = Table.read_table("avocados.csv")
|
43
|
+
av_df.show(5)
|
44
|
+
```
|
45
|
+
|
46
|
+
## Scatter Plot
|
47
|
+
|
48
|
+
To investigate the relationship between price and quantity demanded, we can start by creating a scatter plot:
|
49
|
+
|
50
|
+
```python
|
51
|
+
av_df.scatter("Average Price", "Total Volume")
|
52
|
+
plt.ylabel("Quantity Demanded")
|
53
|
+
plt.xlabel("Price")
|
54
|
+
plt.title("Demand Curve")
|
55
|
+
plt.show()
|
56
|
+
```
|
57
|
+
|
58
|
+
By looking at the scatter plot, we get a sense that there's a somewhat strong negative relationship between price and quantity demanded. Notice that the points don't all lie on a straight line as it's not a perfect association. This is because there is some random variation in the data plus there are factors other than price that affect the quantity demanded (we'll talk more about this later).
|
59
|
+
|
60
|
+
## Correlation Coefficient
|
61
|
+
|
62
|
+
To quantify the association and give it a numerical value, we can calculate the correlation coefficient between price and quantity demanded. Recall from DATA 8 that the correlation coefficient only measures **linear** association so for the purpose of this class, we are assuming that the relationship between the price and quantity demanded of avocados is linear (in reality, it may be nonlinear).
|
63
|
+
|
64
|
+
Let's start by defining a function to calculate the correlation coefficient.
|
65
|
+
|
66
|
+
```python
|
67
|
+
# Assumes that array_x and array_y are in standard units
|
68
|
+
def correlation(array_x, array_y):
|
69
|
+
return np.mean(array_x * array_y)
|
70
|
+
```
|
71
|
+
|
72
|
+
The first step in calculating the correlation coefficient is converting the data to standard units. When you express a value in standard units, you are basically standardizing the variable by expressing it in terms of the number of standard deviations it is above or below the mean (you can think of this as the "distance" from the mean). For example, the first value of the price shown below is around 1.2 standard deviations below the mean price.
|
73
|
+
|
74
|
+
Expressing the data in standard units removes the measured units (e.g. dollars) from the data. So, the correlation coefficient is not affected by a change in the units/scale (e.g. if you convert the prices from USD to CAD).
|
75
|
+
|
76
|
+
```python
|
77
|
+
# Expressing price in standard units
|
78
|
+
price_su = (av_df["Average Price"] - np.mean(av_df["Average Price"]))/np.std(av_df["Average Price"])
|
79
|
+
price_su[:5]
|
80
|
+
```
|
81
|
+
|
82
|
+
```python
|
83
|
+
# Expressing quantity in standard units
|
84
|
+
quantity_su = (av_df["Total Volume"] - np.mean(av_df["Total Volume"]))/np.std(av_df["Total Volume"])
|
85
|
+
quantity_su[:5]
|
86
|
+
```
|
87
|
+
|
88
|
+
Now we can use the function we defined to calculate the correlation coefficient:
|
89
|
+
|
90
|
+
```python
|
91
|
+
r = correlation(price_su, quantity_su)
|
92
|
+
r
|
93
|
+
```
|
94
|
+
|
95
|
+
As we expected, there is a pretty strong negative correlation between price and demand! It aligns with what we saw in the scatter plot.
|
96
|
+
|
97
|
+
## Regression Equation
|
98
|
+
|
99
|
+
Let's use the correlation coefficient to build the regression equation. The first step is to calculate the slope and intercept:
|
100
|
+
|
101
|
+
```python
|
102
|
+
slope = r * np.std(av_df["Total Volume"])/np.std(av_df["Average Price"])
|
103
|
+
slope
|
104
|
+
```
|
105
|
+
|
106
|
+
```python
|
107
|
+
intercept = np.mean(av_df["Total Volume"]) - (slope * np.mean(av_df["Average Price"]))
|
108
|
+
intercept
|
109
|
+
```
|
110
|
+
|
111
|
+
We can now write the regression equation for the relationship between price and quantity demanded.
|
112
|
+
|
113
|
+
$$\hat{quantity} = 1446951.641 -476412.719\hat{price}$$
|
114
|
+
|
115
|
+
This equation allows you to do 2 things:
|
116
|
+
|
117
|
+
- Quantify the association between price and quantity demanded: as price increases by $1, the quantity demanded decreases by around 476,000 units (avocados)
|
118
|
+
|
119
|
+
|
120
|
+
- Predict values of quantity demanded: this is especially useful for values of price that are not already in the dataset
|
121
|
+
|
122
|
+
Note that the regression line is the line of best fit for your data (more on this later). We can plot this line of best fit on the scatter plot from before:
|
123
|
+
|
124
|
+
```python
|
125
|
+
av_df.scatter("Average Price", "Total Volume")
|
126
|
+
plt.plot(av_df["Average Price"], slope * av_df["Average Price"] + intercept, color='tab:orange')
|
127
|
+
plt.show()
|
128
|
+
```
|
129
|
+
|
130
|
+
Does this graph look familiar? You've already done a similar analysis in week 2 when you used `np.polyfit` to find the slope and intercept of the line of best fit!
|
131
|
+
|
132
|
+
```python
|
133
|
+
params = np.polyfit(av_df["Average Price"], av_df["Total Volume"], 1)
|
134
|
+
params
|
135
|
+
```
|
136
|
+
|
137
|
+
The slope and intercept you get from `np.polyfit` should be very close to the slope and intercept from before (there might be some tiny variation). This is because under the hood, np.polyfit is using simple linear regression to calculate the parameters.
|
138
|
+
|
139
|
+
```python
|
140
|
+
slope_param = params[0]
|
141
|
+
np.round(slope_param, 5) == np.round(slope, 5)
|
142
|
+
```
|
143
|
+
|
144
|
+
```python
|
145
|
+
intercept_param = params[1]
|
146
|
+
np.round(intercept_param, 5) == np.round(intercept, 5)
|
147
|
+
```
|
148
|
+
|
149
|
+
## RMSE
|
150
|
+
|
151
|
+
Let's use the slope and intercept to generate predictions for the quantity demanded:
|
152
|
+
|
153
|
+
```python
|
154
|
+
predicted_quantity = intercept + (slope * av_df["Average Price"])
|
155
|
+
predicted_quantity[:5]
|
156
|
+
```
|
157
|
+
|
158
|
+
We want our predictions to be as accurate as possible. One metric we can use to measure the accuracy of our predictions is the root mean squared error (RMSE) which gives us a sense of how far our predictions are from the actual values, on average.
|
159
|
+
|
160
|
+
Let's define a function for calculating the RMSE of our data (note that this function is specific to the avocados dataset).
|
161
|
+
|
162
|
+
```python
|
163
|
+
def rmse_p_q(slope, intercept):
|
164
|
+
predicted_y = intercept + (slope * av_df["Average Price"])
|
165
|
+
return (np.mean((av_df["Total Volume"] - predicted_y)**2))**0.5
|
166
|
+
```
|
167
|
+
|
168
|
+
```python
|
169
|
+
rmse_1 = rmse_p_q(slope, intercept)
|
170
|
+
rmse_1
|
171
|
+
```
|
172
|
+
|
173
|
+
On average, our predictions are off by about $132,000. This is a pretty high RMSE! This shows that even though price and quantity demanded have a pretty high correlation (about 0.72), it doesn't necessarily mean that price is a good predictor for the quantity demanded.
|
174
|
+
|
175
|
+
We said earlier that the regression line is the line of best fit. The reason why is that the regression line minimizes the RMSE of your data. In other words, among all the lines you can use to model your data, the regression line is the one that is the most accurate as it minimizes RMSE – this is why this technique is called (ordinary) least squares regression.
|
176
|
+
|
177
|
+
We can test this using the minimize function. Recall that minimize takes a function as the input as gives you values of the arguments of that function that minimize it. Below, we would expect the output of the minimize function to be similar to the slope and intercept we calculated:
|
178
|
+
|
179
|
+
```python
|
180
|
+
minimum = minimize(rmse_p_q)
|
181
|
+
minimum
|
182
|
+
```
|
183
|
+
|
184
|
+
The slope and intercept from the minimize function is very similar to the slope and intercept we calculated, as expected!
|
185
|
+
|
186
|
+
When you perform regression in Python (e.g. when you use `np.polyfit`), it tries to find the values of the slope and intercept that minimize the RMSE. You can build your intuition for how it does this by playing around with this slider from [last semester's notebook](https://datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fdata-88e%2Fecon-sp21&urlpath=tree%2Fecon-sp21%2Flec%2Flec12%2Flec12.ipynb&branch=main) (you can also find the link in on last semester's [website](https://data-88e.github.io/sp21/), in econometrics week). See if you can find values for the slope and intercept that minimize the squared error area.
|
187
|
+
|
188
|
+
```python
|
189
|
+
d = Table().with_columns(
|
190
|
+
'x', make_array(0, 1, 2, 3, 4),
|
191
|
+
'y', make_array(1, .5, -1, 2, -3))
|
192
|
+
|
193
|
+
def rmse(target, pred):
|
194
|
+
return np.sqrt(mean_squared_error(target, pred))
|
195
|
+
|
196
|
+
def plot_line_and_errors(slope, intercept):
|
197
|
+
print("RMSE:", rmse(slope * d.column('x') + intercept, d.column('y')))
|
198
|
+
plt.figure(figsize=(5,5))
|
199
|
+
points = make_array(-2, 7)
|
200
|
+
p = plt.plot(points, slope*points + intercept, color='orange', label='Proposed line')
|
201
|
+
ax = p[0].axes
|
202
|
+
|
203
|
+
predicted_ys = slope*d.column('x') + intercept
|
204
|
+
diffs = predicted_ys - d.column('y')
|
205
|
+
for i in np.arange(d.num_rows):
|
206
|
+
x = d.column('x').item(i)
|
207
|
+
y = d.column('y').item(i)
|
208
|
+
diff = diffs.item(i)
|
209
|
+
|
210
|
+
if diff > 0:
|
211
|
+
bottom_left_x = x
|
212
|
+
bottom_left_y = y
|
213
|
+
else:
|
214
|
+
bottom_left_x = x + diff
|
215
|
+
bottom_left_y = y + diff
|
216
|
+
|
217
|
+
ax.add_patch(patches.Rectangle(make_array(bottom_left_x, bottom_left_y), abs(diff), abs(diff), color='red', alpha=.3, label=('Squared error' if i == 0 else None)))
|
218
|
+
plt.plot(make_array(x, x), make_array(y, y + diff), color='red', alpha=.6, label=('Error' if i == 0 else None))
|
219
|
+
|
220
|
+
plt.scatter(d.column('x'), d.column('y'), color='blue', label='Points')
|
221
|
+
|
222
|
+
plt.xlim(-4, 8)
|
223
|
+
plt.ylim(-6, 6)
|
224
|
+
plt.gca().set_aspect('equal', adjustable='box')
|
225
|
+
|
226
|
+
plt.legend(bbox_to_anchor=(1.8, .8))
|
227
|
+
plt.show()
|
228
|
+
|
229
|
+
interact(plot_line_and_errors, slope=widgets.FloatSlider(min=-4, max=4, step=.1), intercept=widgets.FloatSlider(min=-4, max=4, step=.1));
|
230
|
+
```
|
231
|
+
|
232
|
+
## Regression in Python
|
233
|
+
|
234
|
+
So far, we've looked at the DATA 8 version of regression in which we calculated the correlated coefficient and used it to find the slope and intercept of our regression line.
|
235
|
+
|
236
|
+
Now we can look at how regression is actually performed in Python using a library called `statsmodels`.
|
237
|
+
|
238
|
+
```python
|
239
|
+
# import statsmodels.api as sm – see first cell in the notebook
|
240
|
+
x_1 = av_df.column("Average Price") # Selecting the independent variable(s)
|
241
|
+
y_1 = av_df.column("Total Volume") # Selecting the dependent variable
|
242
|
+
model_1 = sm.OLS(y_1, sm.add_constant(x_1)) # Performing OLS regression – make sure to include intercept using sm.add_constant
|
243
|
+
result_1 = model_1.fit() # Fitting the model
|
244
|
+
result_1.summary() # Showing a summary of the results of the regression
|
245
|
+
```
|
246
|
+
|
247
|
+
Let's look at the parameters (slope and intercept) of the regression and see if they are close to the parameters we calculated before:
|
248
|
+
|
249
|
+
```python
|
250
|
+
type(result_1.params)
|
251
|
+
```
|
252
|
+
|
253
|
+
```python
|
254
|
+
np.round(result_1.params[0], 5) == np.round(intercept, 5)
|
255
|
+
```
|
256
|
+
|
257
|
+
```python
|
258
|
+
np.round(result_1.params[1], 5) == np.round(slope, 5)
|
259
|
+
```
|
260
|
+
|
261
|
+
They are similar, as we expected!
|
262
|
+
|
263
|
+
## Hypothesis Testing
|
264
|
+
|
265
|
+
Since we are interested in whether or not there is a substantial association between x and y (in this case, price and quantity demanded), we can frame this as a hypothesis testing question:
|
266
|
+
|
267
|
+
- The null hypothesis would be that there isn't a substantial association between x and y ($\beta = 0$)
|
268
|
+
|
269
|
+
- The alternative hypothesis would be that there is a substantial association between x and y ($\beta ≠ 0$)
|
270
|
+
|
271
|
+
Go back to the regression output from before: notice that it shows a 95% confidence interval for the slope. Let's try to visualize this confidence interval by creating a distribution of the regression slopes:
|
272
|
+
|
273
|
+
```python
|
274
|
+
slopes_array = make_array()
|
275
|
+
for i in np.arange(5000):
|
276
|
+
bootstrapped_sample = av_df.select("Average Price", "Total Volume").sample()
|
277
|
+
x = bootstrapped_sample.column("Average Price")
|
278
|
+
y = bootstrapped_sample.column("Total Volume")
|
279
|
+
results = sm.OLS(y, sm.add_constant(x)).fit()
|
280
|
+
slopes_array = np.append(results.params[1], slopes_array)
|
281
|
+
Table().with_column("Slopes", slopes_array).hist()
|
282
|
+
print("Center of the distribution (mean):", np.mean(slopes_array))
|
283
|
+
print("Standard deviation:", np.std(slopes_array))
|
284
|
+
```
|
285
|
+
|
286
|
+
```python
|
287
|
+
ci_lower_bound = percentile(2.5, slopes_array)
|
288
|
+
ci_lower_bound
|
289
|
+
```
|
290
|
+
|
291
|
+
```python
|
292
|
+
ci_upper_bound = percentile(97.5, slopes_array)
|
293
|
+
ci_upper_bound
|
294
|
+
```
|
295
|
+
|
296
|
+
Notice that the center of the distribution (average of the slopes) is similar to the slope from the regression output. The lower and upper bounds of the confidence interval are also similar.
|
297
|
+
|
298
|
+
Recall that if the confidence interval contains 0, it means that there is evidence for the null hypothesis at the 5% significance level. If it doesn't contain 0, it means that there is evidence against the null hypothesis at the 5% level.
|
299
|
+
|
300
|
+
The confidence interval in the regression output above doesn't contain 0 so there's evidence that price and quantity demanded have a significant association. This makes sense as they have a relatively high correlation coefficient.
|
301
|
+
|
302
|
+
```python
|
303
|
+
|
304
|
+
```
|
305
|
+
|
@@ -0,0 +1,171 @@
|
|
1
|
+
---
|
2
|
+
title: "11.2-mlr"
|
3
|
+
type: lecture-notebook
|
4
|
+
week: 11
|
5
|
+
source_path: "/Users/ericvandusen/Documents/Data88E-ForTraining/F24Lec_NBs/lec11/11.2-mlr.ipynb"
|
6
|
+
---
|
7
|
+
|
8
|
+
<table style="width: 100%;" id="nb-header">
|
9
|
+
<tr style="background-color: transparent;"><td>
|
10
|
+
<img src="https://data-88e.github.io/assets/images/blue_text.png" width="250px" style="margin-left: 0;" />
|
11
|
+
</td><td>
|
12
|
+
<p style="text-align: right; font-size: 10pt;"><strong>Economic Models</strong>, Fall 24 <br>
|
13
|
+
Dr. Eric Van Dusen <br>
|
14
|
+
Vaidehi Bulusu <br>
|
15
|
+
<br>
|
16
|
+
</table>
|
17
|
+
|
18
|
+
# Lecture Notebook 11.2: Multiple Linear Regression
|
19
|
+
|
20
|
+
```python
|
21
|
+
import numpy as np
|
22
|
+
import pandas as pd
|
23
|
+
import matplotlib.pyplot as plt
|
24
|
+
from matplotlib import patches
|
25
|
+
import statsmodels.api as sm
|
26
|
+
import warnings
|
27
|
+
warnings.simplefilter("ignore")
|
28
|
+
from datascience import *
|
29
|
+
%matplotlib inline
|
30
|
+
from sklearn.metrics import mean_squared_error
|
31
|
+
import ipywidgets as widgets
|
32
|
+
from ipywidgets import interact
|
33
|
+
```
|
34
|
+
|
35
|
+
# Multiple Linear Regression
|
36
|
+
|
37
|
+
In simple linear regression, we're only considering one independent variable. We're essentially assuming that this is the only variable that affects our dependent variable. Going back to the price and quantity demanded example, while price does have a huge effect on demand, there are other factors that also affect demand including income taxes, sales taxes, advertising, prices of related goods, etc. In the context of regression, because we are not including these variables in our model, they are called **omitted variables**.
|
38
|
+
|
39
|
+
Omitted variables cause 2 main problems:
|
40
|
+
|
41
|
+
- The regression parameters end up being inaccurate (biased) because of something called omitted variable bias. Your regression estimates for the slope and intercept are higher or lower than the actual values because of omitted variables (we are generally only concerned about the slope).
|
42
|
+
|
43
|
+
|
44
|
+
- They prevent you from inferring a causal association between the independent and dependent variables. In other words, you can't say that it's *because* of an increase in price that your quantity demanded decreased as there are so many other factors – like the change in the price of a related good – that could be causing the decrease in demand.
|
45
|
+
|
46
|
+
To try to eliminate omitted variable bias from our model, we take simple linear regression a step further: to multiple linear regression. In multiple linear regression, we are including more indepedent variables – variables we think are confounding variable that we've omitted – to reduce omitted variable bias.
|
47
|
+
|
48
|
+
Let's look at multiple linear regression in Python using a new dataset on earnings and various other factors (check out the [data description](https://www.princeton.edu/~mwatson/Stock-Watson_3u/Students/EE_Datasets/CPS12_Description.pdf)) [Stock and Watson datasets](https://www.princeton.edu/~mwatson/Stock-Watson_3u/Students/Stock-Watson-EmpiricalExercises-DataSets.htm).
|
49
|
+
|
50
|
+
```python
|
51
|
+
cps = Table.read_table("CPS.csv")
|
52
|
+
cps.show(5)
|
53
|
+
```
|
54
|
+
|
55
|
+
Say we want to look at the relationship between age and earnings (the `ahe` column which is the average hourly earnings). We would expect whether or not a person has a bachelor's degree to be a confounding variable as those with a bachelor's degree typically earn more than those with only a high school degree. This is how we would do multiple linear regression:
|
56
|
+
|
57
|
+
```python
|
58
|
+
x_2 = cps.select("bachelor", "age").values # This is how we include multiple independent variables in our model
|
59
|
+
y_2 = cps.column("ahe")
|
60
|
+
model_2 = sm.OLS(y_2, sm.add_constant(x_2))
|
61
|
+
result_2 = model_2.fit()
|
62
|
+
result_2.summary()
|
63
|
+
```
|
64
|
+
|
65
|
+
## Dummy Variables
|
66
|
+
|
67
|
+
One type of variable commonly used in econometrics are dummy variables. These are variables that take a value of either 0 or 1 to indicate the presence of absence of a category. For example, take col: it takes the value of 1 to indicate that a person went to college and 0 to indicate that a person didn't go to college.
|
68
|
+
|
69
|
+
Let's do a regression of `ahe` on only `bachelor` (the dummy variable).
|
70
|
+
|
71
|
+
```python
|
72
|
+
x_3 = cps.select("bachelor").values
|
73
|
+
y_3 = cps.column("ahe")
|
74
|
+
model_3 = sm.OLS(y_3, sm.add_constant(x_3))
|
75
|
+
result_3 = model_3.fit()
|
76
|
+
result_3.summary()
|
77
|
+
```
|
78
|
+
|
79
|
+
The coefficient (or slope) on `bachelor` is this:
|
80
|
+
|
81
|
+
```python
|
82
|
+
result_3.params[1]
|
83
|
+
```
|
84
|
+
|
85
|
+
This means that the people in our sample with a bachelor's degree earn around $6.583 more per hour than those with only a high school degree.
|
86
|
+
|
87
|
+
An interesting fact about dummy variables is that we can calculate this coefficient another way:
|
88
|
+
|
89
|
+
```python
|
90
|
+
# Filter for bachelor = 1 and find mean earnings
|
91
|
+
b_1_mean = np.mean(cps.where("bachelor", 1).column("ahe"))
|
92
|
+
|
93
|
+
# Filter for bachelor = 0 and find mean earnings
|
94
|
+
b_0_mean = np.mean(cps.where("bachelor", 0).column("ahe"))
|
95
|
+
|
96
|
+
# Take the difference in the mean earnings
|
97
|
+
diff = b_1_mean - b_0_mean
|
98
|
+
diff
|
99
|
+
```
|
100
|
+
|
101
|
+
```python
|
102
|
+
np.round(result_3.params[1], 5) == np.round(diff, 5)
|
103
|
+
```
|
104
|
+
|
105
|
+
These two values are pretty much the same! This is because the coefficient on a dummy x-variable is just equal to the difference of the mean of the y-variable when x = 1 and x = 0.
|
106
|
+
|
107
|
+
### `pd.get_dummies()`
|
108
|
+
|
109
|
+
We can convert categorical variables in our dataset to dummy variables using the pd.get_dummies() function. This function gives you a table showing the presence or absence of dummy variables for that category for each observation in the dataset. Here's an example of converting the year values to dummies. 1 indicates that the person was surveyed in that year (e.g. the first 4 people in the dataset were surveyed in 1992 according to the table below).
|
110
|
+
|
111
|
+
```python
|
112
|
+
year = cps.column("year")
|
113
|
+
np.unique(year)
|
114
|
+
```
|
115
|
+
|
116
|
+
```python
|
117
|
+
year_dummies = pd.get_dummies(year)
|
118
|
+
year_dummies.head()
|
119
|
+
```
|
120
|
+
|
121
|
+
Let's do the same for the age variable. Take the first row as an example: since 1 is under 29, it means that that person is 29 years old.
|
122
|
+
|
123
|
+
```python
|
124
|
+
age = cps.column("age")
|
125
|
+
np.unique(age)
|
126
|
+
```
|
127
|
+
|
128
|
+
```python
|
129
|
+
age_dummies = pd.get_dummies(age)
|
130
|
+
age_dummies.head()
|
131
|
+
```
|
132
|
+
|
133
|
+
### Dummy Variable Trap
|
134
|
+
|
135
|
+
One problem you may run into when using dummy variables is called the dummy variable trap. This happens when you include variables for all the values of the dummy variable: for example, you include col, which takes on a value of 1 if the person went to college, as well as notcol, which takes on a value of 1 if the person didn't go to college.
|
136
|
+
|
137
|
+
This causes redundancy as you can express one independent variable as a linear combination of another independent variable. In this case:
|
138
|
+
|
139
|
+
$$ notcol = 1 - col$$
|
140
|
+
|
141
|
+
This means that there is a perfect correlation between these two independent variables which reuslts in perfect multicollinearity. In general, multicollinearity occurs any time there is a high correlation between the independent variables in your model. It causes your regression estimates to be highly inaccurate.
|
142
|
+
|
143
|
+
Let's see what happens in Python when multicollinearity happens:
|
144
|
+
|
145
|
+
```python
|
146
|
+
year_dummies["ahe"] = np.array(cps.column("ahe"))
|
147
|
+
year_dummies.head()
|
148
|
+
```
|
149
|
+
|
150
|
+
```python
|
151
|
+
x_3 = year_dummies[[1992, 2008]] # We are including dummy variables for each value of year
|
152
|
+
y_3 = year_dummies["ahe"]
|
153
|
+
model_3 = sm.OLS(y_3, sm.add_constant(x_3))
|
154
|
+
result_3 = model_3.fit()
|
155
|
+
result_3.summary()
|
156
|
+
```
|
157
|
+
|
158
|
+
Python detected the multicollinearity and gave you a warning. A solution is to just drop one of the dummy variables.
|
159
|
+
|
160
|
+
```python
|
161
|
+
x_3 = year_dummies[[1992, 2008]] # We are including dummy variables for each value of year
|
162
|
+
y_3 = year_dummies["ahe"]
|
163
|
+
model_3 = sm.OLS(y_3, x_3)
|
164
|
+
result_3 = model_3.fit()
|
165
|
+
result_3.summary()
|
166
|
+
```
|
167
|
+
|
168
|
+
```python
|
169
|
+
|
170
|
+
```
|
171
|
+
|