{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multiple Regression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's grab a small little data set of Blue Book car values:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Mileage Price\n",
"Mileage \n",
"(0, 10000] 5588.629630 24096.714451\n",
"(10000, 20000] 15898.496183 21955.979607\n",
"(20000, 30000] 24114.407104 20278.606252\n",
"(30000, 40000] 33610.338710 19463.670267\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\Frank\\AppData\\Local\\Temp\\ipykernel_2772\\1994202671.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.\n",
" groups = df1.groupby(pd.cut(df1['Mileage'],bins)).mean()\n"
]
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"df1=df[['Mileage','Price']]\n",
"bins = np.arange(0,50000,10000)\n",
"groups = df1.groupby(pd.cut(df1['Mileage'],bins)).mean()\n",
"print(groups.head())\n",
"groups['Price'].plot.line()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.\n",
"\n",
"Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.\n",
"\n",
"Let's scale our feature data into the same range so we can easily compare the coefficients we end up with."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" const Mileage Cylinder Doors\n",
"0 1.0 -1.417485 0.52741 0.556279\n",
"1 1.0 -1.305902 0.52741 0.556279\n",
"2 1.0 -0.810128 0.52741 0.556279\n",
"3 1.0 -0.426058 0.52741 0.556279\n",
"4 1.0 0.000008 0.52741 0.556279\n",
".. ... ... ... ...\n",
"799 1.0 -0.439853 0.52741 0.556279\n",
"800 1.0 -0.089966 0.52741 0.556279\n",
"801 1.0 0.079605 0.52741 0.556279\n",
"802 1.0 0.750446 0.52741 0.556279\n",
"803 1.0 1.932565 0.52741 0.556279\n",
"\n",
"[804 rows x 4 columns]\n",
" OLS Regression Results \n",
"==============================================================================\n",
"Dep. Variable: Price R-squared: 0.360\n",
"Model: OLS Adj. R-squared: 0.358\n",
"Method: Least Squares F-statistic: 150.0\n",
"Date: Mon, 03 Mar 2025 Prob (F-statistic): 3.95e-77\n",
"Time: 10:37:38 Log-Likelihood: -8356.7\n",
"No. Observations: 804 AIC: 1.672e+04\n",
"Df Residuals: 800 BIC: 1.674e+04\n",
"Df Model: 3 \n",
"Covariance Type: nonrobust \n",
"==============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"const 2.134e+04 279.405 76.388 0.000 2.08e+04 2.19e+04\n",
"Mileage -1272.3412 279.567 -4.551 0.000 -1821.112 -723.571\n",
"Cylinder 5587.4472 279.527 19.989 0.000 5038.754 6136.140\n",
"Doors -1404.5513 279.446 -5.026 0.000 -1953.085 -856.018\n",
"==============================================================================\n",
"Omnibus: 157.913 Durbin-Watson: 0.069\n",
"Prob(Omnibus): 0.000 Jarque-Bera (JB): 257.529\n",
"Skew: 1.278 Prob(JB): 1.20e-56\n",
"Kurtosis: 4.074 Cond. No. 1.03\n",
"==============================================================================\n",
"\n",
"Notes:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\Frank\\AppData\\Local\\Temp\\ipykernel_2772\\2726360189.py:8: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].values)\n"
]
}
],
"source": [
"import statsmodels.api as sm\n",
"from sklearn.preprocessing import StandardScaler\n",
"scale = StandardScaler()\n",
"\n",
"X = df[['Mileage', 'Cylinder', 'Doors']]\n",
"y = df['Price']\n",
"\n",
"X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].values)\n",
"\n",
"# Add a constant column to our model so we can have a Y-intercept\n",
"X = sm.add_constant(X)\n",
"\n",
"print (X)\n",
"\n",
"est = sm.OLS(y, X).fit()\n",
"\n",
"print(est.summary())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The table of coefficients above gives us the values to plug into an equation of form:\n",
" B0 + B1 * Mileage + B2 * cylinders + B3 * doors\n",
" \n",
"In this example, it's pretty clear that the number of cylinders is more important than anything based on the coefficients.\n",
"\n",
"Could we have figured that out earlier?"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Doors\n",
"2 23807.135520\n",
"4 20580.670749\n",
"Name: Price, dtype: float64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y.groupby(df.Doors).mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.\n",
"\n",
"How would you use this to make an actual prediction? Start by scaling your multiple feature variables into the same scale used to train the model, then just call est.predict() on the scaled features:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1. 3.07256589 1.96971667 0.55627894]\n",
"[27658.15707316]\n"
]
}
],
"source": [
"scaled = scale.transform([[45000, 8, 4]])\n",
"scaled = np.insert(scaled[0], 0, 1) #Need to add that constant column in again.\n",
"print(scaled)\n",
"predicted = est.predict(scaled)\n",
"print(predicted)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Activity"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:base] *",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}