1 残差分析及 Box-Cox 变换

1.1 例1

> library(alr4)
## 载入需要的程辑包:car
## 载入需要的程辑包:carData
## 载入需要的程辑包:effects
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
> library(MASS)
> myfit=lm(BrainWt~BodyWt,data=brains)
> plot(myfit,which=1) #residual plot

拟合效果不好,残差非线性且方差不齐。

> boxcox(BrainWt~BodyWt,data=brains)

\(\lambda \approx 0\), 对响应BrainWt作log变换。

> myfit=lm(log(BrainWt)~BodyWt,data=brains)
> plot(myfit,which=1) #residual plot

> boxcox(BodyWt~log(BrainWt),data=brains)

\(\lambda \approx 0\), 对自变量BodyWt作log变换。继续残差分析。

> myfit=lm(log(BrainWt)~log(BodyWt),data=brains)
> plot(myfit,which=1) #residual plot

这张残差图表明残差接近正态分布,误差方差是常数,拟合效果很好。

2 简介:探索非线性变换的两种方法 - lowess, IRP

2.1 LOWESS

> se=read.table("salary-experience.txt", head=T,row.names=1)
> se=se[,2:1]
> plot(se)
> lowess(se,f=2/3)->lowess.fit
> lines(lowess.fit)

2.2 IRP

> y=se[,"Salary"]
> myfit=lm(Salary~Experience,data=se)
> y.hat=fitted(myfit)
> plot(y,y.hat)
> lowess(y,y.hat,f=2/3)->lowess.fit
> lines(lowess.fit)

3 回归诊断:残差分析和影响分析

3.1 例2

> edu=read.table("edu.xls", head=T,row.names=1)
> fit1=lm(Expenditure~.,data=edu)
> par(mfrow=c(2,3))
> plot(fit1,which=1:6)

> influence.measures(fit1)$infmat[49:50,]
##        dfb.1_    dfb.Incm    dfb.Yong    dfb.Urbn   dfb.Regn      dffit
## AK -2.3120010  2.31477633  1.89877974 -1.57238898  0.4210676  3.2300799
## HI  0.1084908 -0.09595102 -0.03194678 -0.08114159 -0.1686432 -0.3092706
##        cov.r     cook.d        hat
## AK 0.5682647 1.65390419 0.44954923
## HI 1.1069302 0.01915115 0.09151557
> boxcox(Expenditure~.,data=edu) #lambda=0

> fit2=lm(log(Expenditure)~log(Income)+Young+Urban,data=edu[-49,])
> par(mfrow=c(2,3))
> plot(fit2,1:6)

IRLS算法:

> fit.ini=fit=lm(log(Expenditure)~.,data=edu[-49,])
> repeat{
+   beta=coef(fit)
+   res=resid(fit)
+   sigmasq1=sum(res[1:9]^2)/9
+   sigmasq2=sum(res[10:21]^2)/12
+   sigmasq3=sum(res[22:37]^2)/16
+   sigmasq4=sum(res[38:49]^2)/12
+   sigma2=c(rep(sigmasq1,9),rep(sigmasq2,12),rep(sigmasq3,16),rep(sigmasq4,12))
+   w=1/sigma2
+   fit=lm(log(Expenditure)~log(Income)+Young+Urban,data=edu[-49,],weights=w)
+   beta.new=coef(fit)
+   beta.new
+   delta=sum(abs(beta.new-beta))
+   print(delta)
+   if(delta<1e-10) break
+   beta=beta.new
+ }
## [1] 14.05909
## [1] 0.9162789
## [1] 0.206174
## [1] 0.02831151
## [1] 0.003514726
## [1] 0.0004364649
## [1] 5.438016e-05
## [1] 6.781829e-06
## [1] 8.459452e-07
## [1] 1.055248e-07
## [1] 1.316345e-08
## [1] 1.642057e-09
## [1] 2.048187e-10
## [1] 2.556389e-11
> fit3=fit #final fit
> unique(sigma2)
## [1] 0.017247868 0.029783743 0.001197917 0.017369004

4 练习题

4.1 练习1

> y.hat=fitted(fit3)
> res=resid(fit3)
> x=sqrt(w)*y.hat
> y=sqrt(w)*res
> plot(x,y)

没有方差不齐现象。

4.2 练习2

> p=6;n=51
> myfit=lm(FuelC~.,data=fuel2001)
> par(mfrow=c(2,3))
> plot(myfit,which=1:6)

图1:TX,FL,NY 残差异常(响应变量异常);

图2:除去TX, FL, NY 后误差基本服从正态分布;

图3:方差不齐,有较明显的非线性趋势;

图4:TX, FL, NY Cook距离很大,都是高影响点;

图5:TX, FL, NY Cook距离大于1,其残差较大,leverage也较大,响应变量和自变量都异常,是高影响点;

图6:除去TX,FL,NY 和另一个点,都在等高线1下面,说明标准化残差小于等于 \(\sqrt{6}\).此外,TX,FL,NY 的Cook距离 D 和 leverage \(h_{ii}\) 都较大,高影响。

计算杠杆值:

> lev=influence.measures(myfit)$infmat[,11] #杠杆值hii
> lev[which(lev>3*p/n)]
##        CA        FL        NY        TX        WY 
## 0.5464727 0.5404181 0.4722729 0.4044865 0.3893685

可见除了刚才提到的 TX,FL,NY三个高影响点,CA的杠杆值大,自变量异常,其Cook距离也很大,因此也算是高影响点。

去掉这四个点以后重新拟合:

> myfit1=lm(FuelC~.,data=fuel2001[-c(5,10,33,44),])
> par(mfrow=c(2,3))
> plot(myfit1,which=1:6)

对因变量作 Box-Cox 变换:

> boxcox(FuelC~.,data=fuel2001[-c(5,10,33,44),])

\(\lambda=1\), 不需要变换。重新拟合:

> summary(myfit1)
## 
## Call:
## lm(formula = FuelC ~ ., data = fuel2001[-c(5, 10, 33, 44), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -365823 -114949  -11493   87408  401598 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.483e+05  4.013e+05   0.619 0.539577    
## Drivers      5.411e-01  1.499e-01   3.610 0.000844 ***
## Income      -3.099e+00  8.094e+00  -0.383 0.703849    
## Miles        1.670e+00  1.001e+00   1.668 0.103192    
## MPC          2.888e+01  1.762e+01   1.639 0.109015    
## Pop          9.968e-02  1.322e-01   0.754 0.455357    
## Tax         -2.375e+04  6.327e+03  -3.754 0.000554 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 188200 on 40 degrees of freedom
## Multiple R-squared:  0.9859, Adjusted R-squared:  0.9838 
## F-statistic: 466.2 on 6 and 40 DF,  p-value: < 2.2e-16

Tax 的系数估计值为负,且 \(p=0.000554\) 很小,是显著变量,原问题的结论是汽油周税高的州,汽油消耗较低。

4.3 练习3

相关分析:

> library(corrplot)
## corrplot 0.92 loaded
> corrplot(cor(cloud),diag=F)

发现多个变量与Rain有强相关性,且A = 1的 12 天与 A = 0 的 12 天之间在其它因素上可能还是有系统性差异的。

从中也可以看出 S (在随机化之前直接决定 A 的量)与 D 等变量强相关。

> fit=lm(Rain~.,data=cloud)
> summary(fit)
## 
## Call:
## lm(formula = Rain ~ ., data = cloud)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1820 -1.0170 -0.2685  0.7193  6.8074 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  4.654365   3.337174   1.395    0.181
## A            1.013201   1.203337   0.842    0.411
## D           -0.032125   0.028921  -1.111    0.282
## S           -0.910860   0.751204  -1.213    0.242
## C            0.005654   0.114899   0.049    0.961
## P            1.843790   2.758034   0.669    0.513
## E            2.167987   1.579297   1.373    0.188
## 
## Residual standard error: 2.836 on 17 degrees of freedom
## Multiple R-squared:  0.3849, Adjusted R-squared:  0.1678 
## F-statistic: 1.773 on 6 and 17 DF,  p-value: 0.1647
> par(mfrow=c(2,3))
> plot(fit,which=1:6)

利用Cook距离判断2是异常点,自变量和因变量均异常,去掉之后进行拟合。首先考虑对因变量的Box-Cox变换:

> boxcox(Rain~.,data=cloud[-2,]) #lambda=0.5

\(\lambda=0.5\) 的变换后,服从正态模型。

> fit1=lm(2*(Rain^0.5-1)~.,data=cloud[-2,])#Box-Cox变换后拟合
> summary(fit1)
## 
## Call:
## lm(formula = 2 * (Rain^0.5 - 1) ~ ., data = cloud[-2, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6487 -0.3921  0.1080  0.5071  2.2577 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.53328    1.44132   0.370   0.7162  
## A            1.19638    0.53483   2.237   0.0399 *
## D           -0.01870    0.01192  -1.569   0.1363  
## S           -0.39317    0.30998  -1.268   0.2228  
## C            0.16402    0.08910   1.841   0.0843 .
## P            1.90782    1.15314   1.654   0.1175  
## E            0.99412    0.65374   1.521   0.1479  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.166 on 16 degrees of freedom
## Multiple R-squared:  0.5876, Adjusted R-squared:  0.4329 
## F-statistic: 3.799 on 6 and 16 DF,  p-value: 0.01517
> par(mfrow=c(2,3))
> plot(fit1,which=1:6)

由此可见 A 对 Rain 有显著影响,且分析残差图可知此时模型拟合效果好,方差齐,符合线性的误差正态模型。

因此,结论为 A 与 Rain 存在显著的因果关系。