R 程序包 alr4 中有一个数据集 Heights, 是 Karl Pearson 收集的 1100 多个家庭的母女身高数据,其中每个家庭至多 2 个女儿, 女儿的年龄在 18 岁以上, 母亲年龄小于 65 岁. 我们感兴趣的是母亲身高Mheight 对女儿身高 Dheight 的影响
> install.packages("alr4") # 安装
> library(alr4)
## 载入需要的程辑包:car
## 载入需要的程辑包:carData
## 载入需要的程辑包:effects
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
> head(Heights,5) #查看前5行数据
## mheight dheight
## 1 59.7 55.1
## 2 58.2 56.5
## 3 60.6 56.0
## 4 60.7 56.8
## 5 61.8 56.0
> myfit=lm(dheight~mheight,data=Heights)
> myfit
##
## Call:
## lm(formula = dheight ~ mheight, data = Heights)
##
## Coefficients:
## (Intercept) mheight
## 29.9174 0.5417
> summary(myfit) #details
##
## Call:
## lm(formula = dheight ~ mheight, data = Heights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.397 -1.529 0.036 1.492 9.053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.91744 1.62247 18.44 <2e-16 ***
## mheight 0.54175 0.02596 20.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.266 on 1373 degrees of freedom
## Multiple R-squared: 0.2408, Adjusted R-squared: 0.2402
## F-statistic: 435.5 on 1 and 1373 DF, p-value: < 2.2e-16
> coefficients(myfit)
## (Intercept) mheight
## 29.917437 0.541747
通过二维散点图查看拟合效果:
> plot(Heights)
> abline(myfit,col="red")
> plot(myfit)
alr4 包中的数据集 brains 给出了 62 种哺乳动物的脑重量 (g) 和体重数据 (kg), 我们希望了解脑重(响应) 与体重(自变量)的关系。
> plot(brains[,2:1])
> a=lm(BrainWt~BodyWt,data=brains)
> abline(a,col="red")#拟合效果不好
> plot(log(brains[,2:1]))
> b=lm(BrainWt~BodyWt,data=log(brains))
> abline(b,col="red")#拟合效果很好
查看体重在4kg附近的几种动物数据,显然第二个模型预测更准确一些。
> brains[brains[,2]<4.5 & brains[,2]>3.5,]
## BrainWt BodyWt
## Vervet 57.998 4.190
## Yellow-bellied marmot 17.000 4.050
## Rock hyrax2 21.000 3.600
## Raccoon 39.201 4.288
## Red fox 50.400 4.235
研究福布斯财富与排名的关系。 导入数据框:
> forbes <- read.table(file="forbes2019.txt",header=T,sep="\t",fileEncoding="UTF-8")
> names(forbes)=c("Rank","name_ch","name_en","Wealth","Wealth_source")
直接对Wealth与Rank做线性拟合,得到的效果不好。
> plot(forbes[,c("Rank","Wealth")])
> myfit1=lm(Wealth~Rank,data=forbes)
> abline(myfit1,col="red")#拟合效果不好
分析线性拟合情况:
> summary(myfit1)
##
## Call:
## lm(formula = Wealth ~ Rank, data = forbes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.652 -7.619 -1.689 4.265 76.411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.13537 2.46305 22.39 <2e-16 ***
## Rank -0.54635 0.04264 -12.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.21 on 98 degrees of freedom
## Multiple R-squared: 0.6262, Adjusted R-squared: 0.6224
## F-statistic: 164.2 on 1 and 98 DF, p-value: < 2.2e-16
对两个变量取对数:
> plot(log(forbes[,c("Rank","Wealth")]))
> myfit2=lm(Wealth~Rank,data=log(forbes[,c("Rank","Wealth")]))
> abline(myfit2,col="red")#拟合效果很好
> summary(myfit2)
##
## Call:
## lm(formula = Wealth ~ Rank, data = log(forbes[, c("Rank", "Wealth")]))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34423 -0.03443 -0.02314 0.00228 0.21810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.219425 0.036374 143.49 <2e-16 ***
## Rank -0.568528 0.009708 -58.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08949 on 98 degrees of freedom
## Multiple R-squared: 0.9722, Adjusted R-squared: 0.9719
## F-statistic: 3430 on 1 and 98 DF, p-value: < 2.2e-16
可见,在使用对数拟合后,决定系数\(R^2\)明显上升,拟合效果提升很多。 所以,最终的模型为: \[ \log(Wealth)=5.219425-0.568528 \times \log(Rank). \] 也就是: \[ Wealth=e^{5.219425} \times Rank^{-0.568528}. \]
导入成年人身高-体重数据:
> wh <- read.table(file="height-weight.txt",header=T,sep="\t",fileEncoding="UTF-8")
查看前5行:
> head(wh,5)
## sex weight height
## 1 1 77 1.82
## 2 0 58 1.61
## 3 0 53 1.61
## 4 1 68 1.77
## 5 0 59 1.57
(a)求LS估计 使用所有数据估计:
> myfit3=lm(weight~height,data=log(wh))
> summary(myfit3)
##
## Call:
## lm(formula = weight ~ height, data = log(wh))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27563 -0.07883 -0.00480 0.07677 0.45833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.59878 0.08843 29.39 <2e-16 ***
## height 2.92966 0.16521 17.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1216 on 197 degrees of freedom
## Multiple R-squared: 0.6148, Adjusted R-squared: 0.6129
## F-statistic: 314.4 on 1 and 197 DF, p-value: < 2.2e-16
男性:
> wh_male=wh[wh["sex"]==1,]
> myfit_male=lm(weight~height,data=log(wh_male))
> summary(myfit_male)
##
## Call:
## lm(formula = weight ~ height, data = log(wh_male))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29310 -0.08979 -0.00310 0.08148 0.43410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9825 0.2186 13.64 < 2e-16 ***
## height 2.3181 0.3787 6.12 2.69e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1277 on 86 degrees of freedom
## Multiple R-squared: 0.3034, Adjusted R-squared: 0.2953
## F-statistic: 37.46 on 1 and 86 DF, p-value: 2.689e-08
女性:
> wh_female=wh[wh["sex"]==0,]
> myfit_female=lm(weight~height,data=log(wh_female))
> summary(myfit_female)
##
## Call:
## lm(formula = weight ~ height, data = log(wh_female))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28367 -0.05921 -0.01299 0.06503 0.31279
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1203 0.1413 22.08 < 2e-16 ***
## height 1.8333 0.2829 6.48 2.74e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.103 on 109 degrees of freedom
## Multiple R-squared: 0.2781, Adjusted R-squared: 0.2715
## F-statistic: 41.99 on 1 and 109 DF, p-value: 2.743e-09
于是,所有数据: \[ \hat{a}=2.59878,\hat{b}=2.92966,\hat{\sigma}=0.1216. \] 男性: \[ \hat{a}=2.9825,\hat{b}=2.3181,\hat{\sigma}=0.1277. \] 女性: \[ \hat{a}=3.1203,\hat{b}=1.8333,\hat{\sigma}=0.103. \] 对于男性,公式(1)可以表示为: \[ \frac{W}{H^{2.3181}} > \exp\{2.9825+1.645\times 0.1277\}=24.35083. \] 代入我的身高体重\(H=1.738,W=69.1\), 左边=\(\frac{69.1}{1.738^{2.3181}}\)=19.18751<24.35083, 所以我没有超重。 \[ BMI=\frac{W}{H^2}=\frac{69.1}{1.738^2}=22.87592. \] 计算出的BMI在18.5-25这一正常范围内,故判别结果相同。
(b)假设检验b=2,则检验的p值为
> n=dim(wh)[1] # n=199
> wh_log=log(wh)
> height=wh_log["height"]
> s_xx=(n-1)*var(height)
> t=(2.92966-2)*sqrt(s_xx)/0.1216
> p=1-pt(t,n-2)
> print(p[1,1])
## [1] 3.090785e-08
显著,所以我们不能认为得到了BMI计算公式。
(c)控制性别
> wh_c=wh_log
> wh_c[wh_c["sex"]==0,]["sex"]<-1
> wh_c[wh_c["sex"]<0,]["sex"]<-0
> myfit_c=lm(weight~height+sex,data=wh_c)
> coefficients(myfit_c)
## (Intercept) height sex
## 3.0087053 2.0571556 0.1240784
这时得到的\(\hat{b}=2.0571556\),能推断出\(b=2\).
先写一个计算首位数字的函数calfirstnum:
> calfirstnum<-function(x){
+ #x>0
+ if(x>1){
+ if(x<10){
+ return(floor(x))
+ }
+ else{
+ return(calfirstnum(x/10))
+ }
+ }
+ else{
+ return(calfirstnum(10*x))
+ }
+ }
> y=rnorm(10000)
> x=exp(y)
> count=numeric(9)
> for (i in 1:10000){
+ for(j in 1:9){
+ if (calfirstnum(x[i])==j) count[j]<-count[j]+1
+ }
+ }
> count_norm=count/10000
> a=seq(0,9,0.1)
> f=log10(1+1/a)
> plot(a,f,type="l", lwd=2,col="red",main="f=log10(1+1/x)")
> points(1:9,count_norm)
可以看出模拟数据对应的散点落在函数曲线上!所以论断成立。
如果把方差换成其他数,例如取\(\sigma=20\).发现结论依然成立。
> y=rnorm(10000,0,20)
> x=exp(y)
> count=numeric(9)
> for (i in 1:10000){
+ for(j in 1:9){
+ if (calfirstnum(x[i])==j) count[j]<-count[j]+1
+ }
+ }
> count_norm=count/10000
> a=seq(0,9,0.1)
> f=log10(1+1/a)
> plot(a,f,type="l", lwd=2,col="red",main="f=log10(1+1/x)")
> points(1:9,count_norm)