Linear regression of Boston Housing Dataset using R

Posted on Mon 06 November 2017 in Notebook

In [1]:
url <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/MASS/Boston.csv"
path <- "/tmp/Boston.csv"
download.file(url, path)
In [2]:
df <- read.csv(path, header = T)
In [3]:
head(df)
Xcrimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
1 0.0063218 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7

Statistic

In [4]:
summary(df)
       X              crim                zn             indus      
 Min.   :  1.0   Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46  
 1st Qu.:127.2   1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19  
 Median :253.5   Median : 0.25651   Median :  0.00   Median : 9.69  
 Mean   :253.5   Mean   : 3.61352   Mean   : 11.36   Mean   :11.14  
 3rd Qu.:379.8   3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10  
 Max.   :506.0   Max.   :88.97620   Max.   :100.00   Max.   :27.74  
      chas              nox               rm             age        
 Min.   :0.00000   Min.   :0.3850   Min.   :3.561   Min.   :  2.90  
 1st Qu.:0.00000   1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02  
 Median :0.00000   Median :0.5380   Median :6.208   Median : 77.50  
 Mean   :0.06917   Mean   :0.5547   Mean   :6.285   Mean   : 68.57  
 3rd Qu.:0.00000   3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08  
 Max.   :1.00000   Max.   :0.8710   Max.   :8.780   Max.   :100.00  
      dis              rad              tax           ptratio     
 Min.   : 1.130   Min.   : 1.000   Min.   :187.0   Min.   :12.60  
 1st Qu.: 2.100   1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40  
 Median : 3.207   Median : 5.000   Median :330.0   Median :19.05  
 Mean   : 3.795   Mean   : 9.549   Mean   :408.2   Mean   :18.46  
 3rd Qu.: 5.188   3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20  
 Max.   :12.127   Max.   :24.000   Max.   :711.0   Max.   :22.00  
     black            lstat            medv      
 Min.   :  0.32   Min.   : 1.73   Min.   : 5.00  
 1st Qu.:375.38   1st Qu.: 6.95   1st Qu.:17.02  
 Median :391.44   Median :11.36   Median :21.20  
 Mean   :356.67   Mean   :12.65   Mean   :22.53  
 3rd Qu.:396.23   3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :396.90   Max.   :37.97   Max.   :50.00  

Correlation coefficient¶

In [5]:
cor(df)
Xcrimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
X 1.000000000 0.40740717 -0.10339336 0.39943885 -0.003759115 0.39873617 -0.07997115 0.20378351 -0.30221096 0.686001976 0.66662592 0.2910742 -0.29504123 0.2584648 -0.2266036
crim 0.407407172 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171 -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456 -0.38506394 0.4556215 -0.3883046
zn-0.103393357-0.20046922 1.00000000 -0.53382819 -0.042696719-0.51660371 0.31199059 -0.56953734 0.66440822 -0.311947826-0.31456332 -0.3916785 0.17552032 -0.4129946 0.3604453
indus 0.399438850 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145 -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476 -0.35697654 0.6037997 -0.4837252
chas-0.003759115-0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281 0.09125123 0.08651777 -0.09917578 -0.007368241-0.03558652 -0.1215152 0.04878848 -0.0539293 0.1752602
nox 0.398736174 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000 -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327 -0.38005064 0.5908789 -0.4273208
rm-0.079971150-0.21924670 0.31199059 -0.39167585 0.091251225-0.30218819 1.00000000 -0.24026493 0.20524621 -0.209846668-0.29204783 -0.3555015 0.12806864 -0.6138083 0.6953599
age 0.203783510 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010 -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150 -0.27353398 0.6023385 -0.3769546
dis-0.302210959-0.37967009 0.66440822 -0.70802699 -0.099175780-0.76923011 0.20524621 -0.74788054 1.00000000 -0.494587930-0.53443158 -0.2324705 0.29151167 -0.4969958 0.2499287
rad 0.686001976 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056 -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412 -0.44441282 0.4886763 -0.3816262
tax 0.666625924 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320 -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530 -0.44180801 0.5439934 -0.4685359
ptratio 0.291074227 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268 -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000 -0.17738330 0.3740443 -0.5077867
black-0.295041232-0.38506394 0.17552032 -0.35697654 0.048788485-0.38005064 0.12806864 -0.27353398 0.29151167 -0.444412816-0.44180801 -0.1773833 1.00000000 -0.3660869 0.3334608
lstat 0.258464770 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892 -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443 -0.36608690 1.0000000 -0.7376627
medv-0.226603643-0.38830461 0.36044534 -0.48372516 0.175260177-0.42732077 0.69535995 -0.37695457 0.24992873 -0.381626231-0.46853593 -0.5077867 0.33346082 -0.7376627 1.0000000
In [6]:
cor(df["rm"], df["medv"])
medv
rm0.6953599

Plotting

In [7]:
plot(df)
In [8]:
hist(data.matrix(df["crim"]), breaks = 50, 
     main = "Histogram of Crime Rate",
     xlab = "Per-capita-crime-rate-by-town",
     ylab = "Frequency")
In [9]:
plot(df[c("zn", "indus")], xlab = "zn", ylab = "indus")
In [10]:
plot(df[c("rm", "medv")], xlab = "rm", ylab = "medv")

Linear Regression

In [11]:
x <- data.matrix(df["rm"])
y <- data.matrix(df["medv"])
In [12]:
fit <- lm(y~x)
In [13]:
summary(fit)
Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.346  -2.547   0.090   2.986  39.433 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -34.671      2.650  -13.08   <2e-16 ***
x              9.102      0.419   21.72   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.616 on 504 degrees of freedom
Multiple R-squared:  0.4835,	Adjusted R-squared:  0.4825 
F-statistic: 471.8 on 1 and 504 DF,  p-value: < 2.2e-16
In [14]:
plot(df[c("rm", "medv")], xlab = "rm", ylab = "medv")
abline(fit)

Multiple Regression Analysis

In [15]:
crim <- data.matrix(df["crim"])
zn <- data.matrix(df["zn"])
indus <- data.matrix(df["indus"])
chas <- data.matrix(df["chas"])
nox <- data.matrix(df["nox"])
rm <- data.matrix(df["rm"])
age <- data.matrix(df["age"])
dis <- data.matrix(df["dis"])
rad <- data.matrix(df["rad"])
tax <- data.matrix(df["tax"])
ptratio <- data.matrix(df["ptratio"])
black <- data.matrix(df["black"])
lstat <- data.matrix(df["lstat"])
medv <- data.matrix(df["medv"])

y <- medv
In [16]:
fit <- lm(y ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + lstat)

Step-wise Selection

In [17]:
result <- step(fit)
Start:  AIC=1589.64
y ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + 
    ptratio + black + lstat

          Df Sum of Sq   RSS    AIC
- age      1      0.06 11079 1587.7
- indus    1      2.52 11081 1587.8
<none>                 11079 1589.6
- chas     1    218.97 11298 1597.5
- tax      1    242.26 11321 1598.6
- crim     1    243.22 11322 1598.6
- zn       1    257.49 11336 1599.3
- black    1    270.63 11349 1599.8
- rad      1    479.15 11558 1609.1
- nox      1    487.16 11566 1609.4
- ptratio  1   1194.23 12273 1639.4
- dis      1   1232.41 12311 1641.0
- rm       1   1871.32 12950 1666.6
- lstat    1   2410.84 13490 1687.3

Step:  AIC=1587.65
y ~ crim + zn + indus + chas + nox + rm + dis + rad + tax + ptratio + 
    black + lstat

          Df Sum of Sq   RSS    AIC
- indus    1      2.52 11081 1585.8
<none>                 11079 1587.7
- chas     1    219.91 11299 1595.6
- tax      1    242.24 11321 1596.6
- crim     1    243.20 11322 1596.6
- zn       1    260.32 11339 1597.4
- black    1    272.26 11351 1597.9
- rad      1    481.09 11560 1607.2
- nox      1    520.87 11600 1608.9
- ptratio  1   1200.23 12279 1637.7
- dis      1   1352.26 12431 1643.9
- rm       1   1959.55 13038 1668.0
- lstat    1   2718.88 13798 1696.7

Step:  AIC=1585.76
y ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
    black + lstat

          Df Sum of Sq   RSS    AIC
<none>                 11081 1585.8
- chas     1    227.21 11309 1594.0
- crim     1    245.37 11327 1594.8
- zn       1    257.82 11339 1595.4
- black    1    270.82 11352 1596.0
- tax      1    273.62 11355 1596.1
- rad      1    500.92 11582 1606.1
- nox      1    541.91 11623 1607.9
- ptratio  1   1206.45 12288 1636.0
- dis      1   1448.94 12530 1645.9
- rm       1   1963.66 13045 1666.3
- lstat    1   2723.48 13805 1695.0
In [18]:
summary(result)
Call:
lm(formula = y ~ crim + zn + chas + nox + rm + dis + rad + tax + 
    ptratio + black + lstat)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5984  -2.7386  -0.5046   1.7273  26.2373 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
crim         -0.108413   0.032779  -3.307 0.001010 ** 
zn            0.045845   0.013523   3.390 0.000754 ***
chas          2.718716   0.854240   3.183 0.001551 ** 
nox         -17.376023   3.535243  -4.915 1.21e-06 ***
rm            3.801579   0.406316   9.356  < 2e-16 ***
dis          -1.492711   0.185731  -8.037 6.84e-15 ***
rad           0.299608   0.063402   4.726 3.00e-06 ***
tax          -0.011778   0.003372  -3.493 0.000521 ***
ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
black         0.009291   0.002674   3.475 0.000557 ***
lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared:  0.7406,	Adjusted R-squared:  0.7348 
F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

R