1 例子

1.1 例1

R 程序包 alr4 中有一个数据集 Heights, 是 Karl Pearson 收集的 1100 多个家庭的母女身高数据,其中每个家庭至多 2 个女儿, 女儿的年龄在 18 岁以上, 母亲年龄小于 65 岁. 我们感兴趣的是母亲身高Mheight 对女儿身高 Dheight 的影响

> install.packages("alr4") # 安装
> library(alr4)
## 载入需要的程辑包:car
## 载入需要的程辑包:carData
## 载入需要的程辑包:effects
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
> head(Heights,5) #查看前5行数据
##   mheight dheight
## 1    59.7    55.1
## 2    58.2    56.5
## 3    60.6    56.0
## 4    60.7    56.8
## 5    61.8    56.0
> myfit=lm(dheight~mheight,data=Heights)
> myfit
## 
## Call:
## lm(formula = dheight ~ mheight, data = Heights)
## 
## Coefficients:
## (Intercept)      mheight  
##     29.9174       0.5417
> summary(myfit) #details
## 
## Call:
## lm(formula = dheight ~ mheight, data = Heights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.397 -1.529  0.036  1.492  9.053 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.91744    1.62247   18.44   <2e-16 ***
## mheight      0.54175    0.02596   20.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.266 on 1373 degrees of freedom
## Multiple R-squared:  0.2408, Adjusted R-squared:  0.2402 
## F-statistic: 435.5 on 1 and 1373 DF,  p-value: < 2.2e-16
> coefficients(myfit)
## (Intercept)     mheight 
##   29.917437    0.541747

通过二维散点图查看拟合效果:

> plot(Heights)
> abline(myfit,col="red")

> plot(myfit)

1.2 例2

alr4 包中的数据集 brains 给出了 62 种哺乳动物的脑重量 (g) 和体重数据 (kg), 我们希望了解脑重(响应) 与体重(自变量)的关系。

> plot(brains[,2:1])
> a=lm(BrainWt~BodyWt,data=brains)
> abline(a,col="red")#拟合效果不好

> plot(log(brains[,2:1]))
> b=lm(BrainWt~BodyWt,data=log(brains))
> abline(b,col="red")#拟合效果很好

查看体重在4kg附近的几种动物数据,显然第二个模型预测更准确一些。

> brains[brains[,2]<4.5 & brains[,2]>3.5,]
##                       BrainWt BodyWt
## Vervet                 57.998  4.190
## Yellow-bellied marmot  17.000  4.050
## Rock hyrax2            21.000  3.600
## Raccoon                39.201  4.288
## Red fox                50.400  4.235

2 练习

2.1 练习1

研究福布斯财富与排名的关系。 导入数据框:

> forbes <- read.table(file="forbes2019.txt",header=T,sep="\t",fileEncoding="UTF-8")
> names(forbes)=c("Rank","name_ch","name_en","Wealth","Wealth_source")

直接对Wealth与Rank做线性拟合,得到的效果不好。

> plot(forbes[,c("Rank","Wealth")])
> myfit1=lm(Wealth~Rank,data=forbes)
> abline(myfit1,col="red")#拟合效果不好

分析线性拟合情况:

> summary(myfit1)
## 
## Call:
## lm(formula = Wealth ~ Rank, data = forbes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.652  -7.619  -1.689   4.265  76.411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 55.13537    2.46305   22.39   <2e-16 ***
## Rank        -0.54635    0.04264  -12.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.21 on 98 degrees of freedom
## Multiple R-squared:  0.6262, Adjusted R-squared:  0.6224 
## F-statistic: 164.2 on 1 and 98 DF,  p-value: < 2.2e-16

对两个变量取对数:

> plot(log(forbes[,c("Rank","Wealth")]))
> myfit2=lm(Wealth~Rank,data=log(forbes[,c("Rank","Wealth")]))
> abline(myfit2,col="red")#拟合效果很好

> summary(myfit2)
## 
## Call:
## lm(formula = Wealth ~ Rank, data = log(forbes[, c("Rank", "Wealth")]))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34423 -0.03443 -0.02314  0.00228  0.21810 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.219425   0.036374  143.49   <2e-16 ***
## Rank        -0.568528   0.009708  -58.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08949 on 98 degrees of freedom
## Multiple R-squared:  0.9722, Adjusted R-squared:  0.9719 
## F-statistic:  3430 on 1 and 98 DF,  p-value: < 2.2e-16

可见,在使用对数拟合后,决定系数\(R^2\)明显上升,拟合效果提升很多。 所以,最终的模型为: \[ \log(Wealth)=5.219425-0.568528 \times \log(Rank). \] 也就是: \[ Wealth=e^{5.219425} \times Rank^{-0.568528}. \]

2.2 练习2

导入成年人身高-体重数据:

> wh <- read.table(file="height-weight.txt",header=T,sep="\t",fileEncoding="UTF-8")

查看前5行:

> head(wh,5)
##   sex weight height
## 1   1     77   1.82
## 2   0     58   1.61
## 3   0     53   1.61
## 4   1     68   1.77
## 5   0     59   1.57

(a)求LS估计 使用所有数据估计:

> myfit3=lm(weight~height,data=log(wh))
> summary(myfit3)
## 
## Call:
## lm(formula = weight ~ height, data = log(wh))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27563 -0.07883 -0.00480  0.07677  0.45833 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.59878    0.08843   29.39   <2e-16 ***
## height       2.92966    0.16521   17.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1216 on 197 degrees of freedom
## Multiple R-squared:  0.6148, Adjusted R-squared:  0.6129 
## F-statistic: 314.4 on 1 and 197 DF,  p-value: < 2.2e-16

男性:

> wh_male=wh[wh["sex"]==1,]
> myfit_male=lm(weight~height,data=log(wh_male))
> summary(myfit_male)
## 
## Call:
## lm(formula = weight ~ height, data = log(wh_male))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29310 -0.08979 -0.00310  0.08148  0.43410 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9825     0.2186   13.64  < 2e-16 ***
## height        2.3181     0.3787    6.12 2.69e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1277 on 86 degrees of freedom
## Multiple R-squared:  0.3034, Adjusted R-squared:  0.2953 
## F-statistic: 37.46 on 1 and 86 DF,  p-value: 2.689e-08

女性:

> wh_female=wh[wh["sex"]==0,]
> myfit_female=lm(weight~height,data=log(wh_female))
> summary(myfit_female)
## 
## Call:
## lm(formula = weight ~ height, data = log(wh_female))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28367 -0.05921 -0.01299  0.06503  0.31279 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.1203     0.1413   22.08  < 2e-16 ***
## height        1.8333     0.2829    6.48 2.74e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.103 on 109 degrees of freedom
## Multiple R-squared:  0.2781, Adjusted R-squared:  0.2715 
## F-statistic: 41.99 on 1 and 109 DF,  p-value: 2.743e-09

于是,所有数据: \[ \hat{a}=2.59878,\hat{b}=2.92966,\hat{\sigma}=0.1216. \] 男性: \[ \hat{a}=2.9825,\hat{b}=2.3181,\hat{\sigma}=0.1277. \] 女性: \[ \hat{a}=3.1203,\hat{b}=1.8333,\hat{\sigma}=0.103. \] 对于男性,公式(1)可以表示为: \[ \frac{W}{H^{2.3181}} > \exp\{2.9825+1.645\times 0.1277\}=24.35083. \] 代入我的身高体重\(H=1.738,W=69.1\), 左边=\(\frac{69.1}{1.738^{2.3181}}\)=19.18751<24.35083, 所以我没有超重。 \[ BMI=\frac{W}{H^2}=\frac{69.1}{1.738^2}=22.87592. \] 计算出的BMI在18.5-25这一正常范围内,故判别结果相同。

(b)假设检验b=2,则检验的p值为

> n=dim(wh)[1] # n=199
> wh_log=log(wh)
> height=wh_log["height"]
> s_xx=(n-1)*var(height)
> t=(2.92966-2)*sqrt(s_xx)/0.1216
> p=1-pt(t,n-2)
> print(p[1,1])
## [1] 3.090785e-08

显著,所以我们不能认为得到了BMI计算公式。

(c)控制性别

> wh_c=wh_log
> wh_c[wh_c["sex"]==0,]["sex"]<-1
> wh_c[wh_c["sex"]<0,]["sex"]<-0
> myfit_c=lm(weight~height+sex,data=wh_c)
> coefficients(myfit_c)
## (Intercept)      height         sex 
##   3.0087053   2.0571556   0.1240784

这时得到的\(\hat{b}=2.0571556\),能推断出\(b=2\).

2.3 练习3

先写一个计算首位数字的函数calfirstnum:

> calfirstnum<-function(x){
+   #x>0
+   if(x>1){
+     if(x<10){
+       return(floor(x))
+     }
+     else{
+       return(calfirstnum(x/10))
+     }
+   }
+   else{
+     return(calfirstnum(10*x))
+   }
+ }
> y=rnorm(10000)
> x=exp(y)
> count=numeric(9)
> for (i in 1:10000){
+   for(j in 1:9){
+     if (calfirstnum(x[i])==j) count[j]<-count[j]+1
+   }
+ }
> count_norm=count/10000
> a=seq(0,9,0.1)
> f=log10(1+1/a)
> plot(a,f,type="l", lwd=2,col="red",main="f=log10(1+1/x)")
> points(1:9,count_norm)

可以看出模拟数据对应的散点落在函数曲线上!所以论断成立。

如果把方差换成其他数,例如取\(\sigma=20\).发现结论依然成立。

> y=rnorm(10000,0,20)
> x=exp(y)
> count=numeric(9)
> for (i in 1:10000){
+   for(j in 1:9){
+     if (calfirstnum(x[i])==j) count[j]<-count[j]+1
+   }
+ }
> count_norm=count/10000
> a=seq(0,9,0.1)
> f=log10(1+1/a)
> plot(a,f,type="l", lwd=2,col="red",main="f=log10(1+1/x)")
> points(1:9,count_norm)