> library(alr4)
## 载入需要的程辑包:car
## 载入需要的程辑包:carData
## 载入需要的程辑包:effects
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
> library(MASS)
> myfit=lm(BrainWt~BodyWt,data=brains)
> plot(myfit,which=1) #residual plot
拟合效果不好,残差非线性且方差不齐。
> boxcox(BrainWt~BodyWt,data=brains)
\(\lambda \approx 0\), 对响应BrainWt作log变换。
> myfit=lm(log(BrainWt)~BodyWt,data=brains)
> plot(myfit,which=1) #residual plot
> boxcox(BodyWt~log(BrainWt),data=brains)
\(\lambda \approx 0\), 对自变量BodyWt作log变换。继续残差分析。
> myfit=lm(log(BrainWt)~log(BodyWt),data=brains)
> plot(myfit,which=1) #residual plot
这张残差图表明残差接近正态分布,误差方差是常数,拟合效果很好。
> se=read.table("salary-experience.txt", head=T,row.names=1)
> se=se[,2:1]
> plot(se)
> lowess(se,f=2/3)->lowess.fit
> lines(lowess.fit)
> y=se[,"Salary"]
> myfit=lm(Salary~Experience,data=se)
> y.hat=fitted(myfit)
> plot(y,y.hat)
> lowess(y,y.hat,f=2/3)->lowess.fit
> lines(lowess.fit)
> edu=read.table("edu.xls", head=T,row.names=1)
> fit1=lm(Expenditure~.,data=edu)
> par(mfrow=c(2,3))
> plot(fit1,which=1:6)
> influence.measures(fit1)$infmat[49:50,]
## dfb.1_ dfb.Incm dfb.Yong dfb.Urbn dfb.Regn dffit
## AK -2.3120010 2.31477633 1.89877974 -1.57238898 0.4210676 3.2300799
## HI 0.1084908 -0.09595102 -0.03194678 -0.08114159 -0.1686432 -0.3092706
## cov.r cook.d hat
## AK 0.5682647 1.65390419 0.44954923
## HI 1.1069302 0.01915115 0.09151557
> boxcox(Expenditure~.,data=edu) #lambda=0
> fit2=lm(log(Expenditure)~log(Income)+Young+Urban,data=edu[-49,])
> par(mfrow=c(2,3))
> plot(fit2,1:6)
IRLS算法:
> fit.ini=fit=lm(log(Expenditure)~.,data=edu[-49,])
> repeat{
+ beta=coef(fit)
+ res=resid(fit)
+ sigmasq1=sum(res[1:9]^2)/9
+ sigmasq2=sum(res[10:21]^2)/12
+ sigmasq3=sum(res[22:37]^2)/16
+ sigmasq4=sum(res[38:49]^2)/12
+ sigma2=c(rep(sigmasq1,9),rep(sigmasq2,12),rep(sigmasq3,16),rep(sigmasq4,12))
+ w=1/sigma2
+ fit=lm(log(Expenditure)~log(Income)+Young+Urban,data=edu[-49,],weights=w)
+ beta.new=coef(fit)
+ beta.new
+ delta=sum(abs(beta.new-beta))
+ print(delta)
+ if(delta<1e-10) break
+ beta=beta.new
+ }
## [1] 14.05909
## [1] 0.9162789
## [1] 0.206174
## [1] 0.02831151
## [1] 0.003514726
## [1] 0.0004364649
## [1] 5.438016e-05
## [1] 6.781829e-06
## [1] 8.459452e-07
## [1] 1.055248e-07
## [1] 1.316345e-08
## [1] 1.642057e-09
## [1] 2.048187e-10
## [1] 2.556389e-11
> fit3=fit #final fit
> unique(sigma2)
## [1] 0.017247868 0.029783743 0.001197917 0.017369004
> y.hat=fitted(fit3)
> res=resid(fit3)
> x=sqrt(w)*y.hat
> y=sqrt(w)*res
> plot(x,y)
没有方差不齐现象。
> p=6;n=51
> myfit=lm(FuelC~.,data=fuel2001)
> par(mfrow=c(2,3))
> plot(myfit,which=1:6)
图1:TX,FL,NY 残差异常(响应变量异常);
图2:除去TX, FL, NY 后误差基本服从正态分布;
图3:方差不齐,有较明显的非线性趋势;
图4:TX, FL, NY Cook距离很大,都是高影响点;
图5:TX, FL, NY Cook距离大于1,其残差较大,leverage也较大,响应变量和自变量都异常,是高影响点;
图6:除去TX,FL,NY 和另一个点,都在等高线1下面,说明标准化残差小于等于 \(\sqrt{6}\).此外,TX,FL,NY 的Cook距离 D 和 leverage \(h_{ii}\) 都较大,高影响。
计算杠杆值:
> lev=influence.measures(myfit)$infmat[,11] #杠杆值hii
> lev[which(lev>3*p/n)]
## CA FL NY TX WY
## 0.5464727 0.5404181 0.4722729 0.4044865 0.3893685
可见除了刚才提到的 TX,FL,NY三个高影响点,CA的杠杆值大,自变量异常,其Cook距离也很大,因此也算是高影响点。
去掉这四个点以后重新拟合:
> myfit1=lm(FuelC~.,data=fuel2001[-c(5,10,33,44),])
> par(mfrow=c(2,3))
> plot(myfit1,which=1:6)
对因变量作 Box-Cox 变换:
> boxcox(FuelC~.,data=fuel2001[-c(5,10,33,44),])
\(\lambda=1\), 不需要变换。重新拟合:
> summary(myfit1)
##
## Call:
## lm(formula = FuelC ~ ., data = fuel2001[-c(5, 10, 33, 44), ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -365823 -114949 -11493 87408 401598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.483e+05 4.013e+05 0.619 0.539577
## Drivers 5.411e-01 1.499e-01 3.610 0.000844 ***
## Income -3.099e+00 8.094e+00 -0.383 0.703849
## Miles 1.670e+00 1.001e+00 1.668 0.103192
## MPC 2.888e+01 1.762e+01 1.639 0.109015
## Pop 9.968e-02 1.322e-01 0.754 0.455357
## Tax -2.375e+04 6.327e+03 -3.754 0.000554 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 188200 on 40 degrees of freedom
## Multiple R-squared: 0.9859, Adjusted R-squared: 0.9838
## F-statistic: 466.2 on 6 and 40 DF, p-value: < 2.2e-16
Tax 的系数估计值为负,且 \(p=0.000554\) 很小,是显著变量,原问题的结论是汽油周税高的州,汽油消耗较低。
相关分析:
> library(corrplot)
## corrplot 0.92 loaded
> corrplot(cor(cloud),diag=F)
发现多个变量与Rain有强相关性,且A = 1的 12 天与 A = 0 的 12 天之间在其它因素上可能还是有系统性差异的。
从中也可以看出 S (在随机化之前直接决定 A 的量)与 D 等变量强相关。
> fit=lm(Rain~.,data=cloud)
> summary(fit)
##
## Call:
## lm(formula = Rain ~ ., data = cloud)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1820 -1.0170 -0.2685 0.7193 6.8074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.654365 3.337174 1.395 0.181
## A 1.013201 1.203337 0.842 0.411
## D -0.032125 0.028921 -1.111 0.282
## S -0.910860 0.751204 -1.213 0.242
## C 0.005654 0.114899 0.049 0.961
## P 1.843790 2.758034 0.669 0.513
## E 2.167987 1.579297 1.373 0.188
##
## Residual standard error: 2.836 on 17 degrees of freedom
## Multiple R-squared: 0.3849, Adjusted R-squared: 0.1678
## F-statistic: 1.773 on 6 and 17 DF, p-value: 0.1647
> par(mfrow=c(2,3))
> plot(fit,which=1:6)
利用Cook距离判断2是异常点,自变量和因变量均异常,去掉之后进行拟合。首先考虑对因变量的Box-Cox变换:
> boxcox(Rain~.,data=cloud[-2,]) #lambda=0.5
作\(\lambda=0.5\) 的变换后,服从正态模型。
> fit1=lm(2*(Rain^0.5-1)~.,data=cloud[-2,])#Box-Cox变换后拟合
> summary(fit1)
##
## Call:
## lm(formula = 2 * (Rain^0.5 - 1) ~ ., data = cloud[-2, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6487 -0.3921 0.1080 0.5071 2.2577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.53328 1.44132 0.370 0.7162
## A 1.19638 0.53483 2.237 0.0399 *
## D -0.01870 0.01192 -1.569 0.1363
## S -0.39317 0.30998 -1.268 0.2228
## C 0.16402 0.08910 1.841 0.0843 .
## P 1.90782 1.15314 1.654 0.1175
## E 0.99412 0.65374 1.521 0.1479
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.166 on 16 degrees of freedom
## Multiple R-squared: 0.5876, Adjusted R-squared: 0.4329
## F-statistic: 3.799 on 6 and 16 DF, p-value: 0.01517
> par(mfrow=c(2,3))
> plot(fit1,which=1:6)
由此可见 A 对 Rain 有显著影响,且分析残差图可知此时模型拟合效果好,方差齐,符合线性的误差正态模型。
因此,结论为 A 与 Rain 存在显著的因果关系。