《An Introduction to Statistical Learning》读后感精选

kaige

发布日期：2023-04-27 11:40:09

《An Introduction to Statistical Learning》是一本由Gareth James / Daniela Witten / 著作，Springer出版的Hardcover图书，本书定价：USD 79.99，页数：426，特精心收集的读后感，希望对大家能有帮助。

《An Introduction to Statistical Learning》读后感(一)：业界良心

业界良心，为学渣精心打造……深入浅出，甚至连矩阵怎么算怕你不会都告诉你，而且尽量避免使用矩阵之类的纯数学的表达，比较适合只学习应用的同学，不用关心太多内在证明。例子给的也很足，非常实际。R的例子讲的也很实用。总之非常适合自学。

《An Introduction to Statistical Learning》读后感(二)：ASL简评

1，统计学习的入门书，通俗易懂，号称是ESL的入门版，全书没有太多数学推导，适合学工程的人不适合学统计的人读。2，监督学习占了大部分篇幅，我觉得这本书最好的部分就是模型的讨论都围绕variance和bias的trade-off展开，还有就是对模型的整体性能，以及参数的经验取值都给出了非常详细的介绍。

《An Introduction to Statistical Learning》读后感(三)：入门最佳

很适合入门，几乎没有什么数学，英文读起来也很简单，一些词汇不懂可以对照中文版。中文版叫：统计学习导论：基于 R 应用。适合刚刚接触机器学习的同学阅读。和适合我这种菜鸟阅读学习，下载了 N 本机器学习的书了，这本是唯一能读的下去的。初学主要是先了解概念，对机器学习有一个大致概念，没必要一上来就去纠结各种数学证明，各种算法细节。

《An Introduction to Statistical Learning》读后感(四)：具有启发性的地方

1. expected test MSE

use：to assess the accuracy of model predictions.

obtain: repeatedly estimate f using a large number of training sets and test each at x0.

decompose: into 3 parts -- variance, bias and irreducible error.

note: the meaning of variance and bias, the trade-off between variance and bias (geneally, more flexible methods result in higher variance and less bias).

2. 理解 the standard error of the mean (SEM)

1) 为什么要提出 SEM 的概念？

因为在平均意义上，样本均值作为总体均值的估计是无偏的。但是，对于一个单独的样本来说，我们把它的均值作为总体均值的估计会造成偏差。这种偏差的平均水平将由 SEM 来度量。

2) 一个 population 有自身的分布，因而有自身的 mean 和 variance。现在由于观测不到 population mean，用 sample mean 作为 population mean 的估计（样本统计量作为总体参数的估计值例子之一）。重复抽样并记录下多个 sample mean，这些 sample mean 形成了一个新的分布，叫做 the sampling distribution of the population mean，这个分布又有自身的 mean 和 variance。

- 新分布的 mean 等于 population mean（unbiasedness）。

- 新分布的 variance 等于 population variance 除以样本容量n。

注：SEM 的概念本身没什么好说的，但对于区分总体分布、样本分布、样本统计量的分布有益。譬如问到估计 population mean 的 standard error 是什么（怎样获得），它与 population 的 standard deviation/variance 有什么区别和联系，应该用上述过程作答。

3. 解释 simple linear regression 和 multiple linear regression 结论中看似矛盾之处

simple linear regression 显示 sales 和 newspaper 显著相关，然而在 sales 对 newspaper、TV、radio 的 multiple linear regression 中，sales 和 newspaper 的相关关系却并不显著，如何解释这种矛盾？

关键在于，simple 是 ingnoring other predictors，而 multiple 是 holding other predictors fixed.

检查相关系数矩阵可以发现，newspaper 和 radio 之间的相关系数很大。那么极有可能是，newspaper 本身对 sales 没什么影响，但更高的 newspaper 时常伴随着更高的 radio，而 radio 对 sales 有显著影响。因此，simple 显示的是伪相关关系，newspaper 只是 radio 的一个代理变量。

常见的例子还有溺水率--天气--冰淇淋销量，并不是冰淇淋导致了溺水率的上升，而是冰淇淋的高销量时常伴随着高温天气，高温天气导致高溺水率。

这种矛盾间接说明了选取控制变量的重要性。

4. 有了 t 检验为什么还需要 F 检验

学统计学的时候，问题的答案很明显，因为 t 检验针对单个参数，而 F 检验针对回归方程（多个参数）。但这样回答只是因为 we're taught to believe this。进一步追问，F 检验的 alternative hypothesis 是至少有一个参数不为0，那我直接看各个参数的 t-statistic 或者 p-value，只要有一个显著，不就可以拒绝 F 检验的 H0 了吗？

这个逻辑是错误的，因为在变量个数 p 很大时会出问题。假设 F 检验的 H0 为真，每个参数都等于0，在5%的显著性水平下，p-values 仅凭运气小于 0.05 的概率为 5%，100个变量参数平均而言会有5个显著，至少有1个显著的概率几乎是100%，问题就在这里。而 F-statistic 根据变量个数进行了调整，不管有多少个变量，p-value 小于 0.05 的概率均为 5%。

5. 解释 simple logistic regression 和 multiple logistic regression 结论中看似矛盾之处

在 default 对 balance, income, student [Yes] 的 multiple logistic regression 中，student [Yes] 这个 dummy variable 的系数为负，而在 default 对 student [Yes] 的 simple logistic regression 中系数为正。如何解释？

和 3 一样，关键在于 simple 是 ingnoring other predictors，而 multiple 是 holding other predictors fixed. 余额相同时，非学生比学生的违约概率更高，但是由于学生比非学生通常持有更多余额，学生的平均违约概率高于非学生。

6. the overall error rate is not of interest

《An Introduction to Statistical Learning》读后感(五)：Notes

Notes of Introduction to Statistical Learning

=====================================

## Statistical Learning

- basic concepts

- two main reasons to estimate f: prediction and inference

- trade-off: complex models may be good for accurate prediction, but it may also be hard to interpret

- reducible vs. irreducible error

- how to estimate f: parametric vs. non-parametric approach

- pros of non-parametric

- flexibility, could possibly fit a wider range of f

- cons

- does not reduce the problem to a set of parameters, may be complicated and need a large dataset to get a nice result

- accessing model accuracy

- training/test error and complexity/flexibility of the model

- note: error could be **MSE** for regression and **error rate** for classification

- bias-variance trade-off

- complex models typically have a higher variance and lower bias, see notes for deep learning for derivation

## Linear Regression

- simple linear regression

- estimate coefficients

- note that the estimation is **unbiased**, which means that average of a large number of estimations for $hat{mu}$ will be very close to true $mu$.

- confidence interval, hypothesis testing for linear relationship using **t-statistic**, p-value

- multiple linear regression

- important questions

- whether there exists a relationship between response and predictors

- use F-statistic for all predictors, instead of use p-value for every single predictor, the reason is when the number of predictors is large, it is very likely to see at least one small p-values (which rejects hypothesis of corresponding coefficient is 0, indicating there is some linear relationship), instead F-statistic takes all predictors into consideration

- deciding on important variables (variable selection)

- methods

- try all possible models with different combinations of variables

- forward/backward/mixed selection

- model fit

- prediction

- extensions of linear model

- linear model is based on two assumptions: additive and linear

- remove additive assumption: introduce interaction term

- nonlinear relationships: polynomial regression, kernel trick, etc.

- potential problems

- nonlinearity issue

- can be identified by **residual plot**

- correlation of error terms

- typically happens for time series data, need to check autocorrelation

- non-constant variance of error terms

- outliers

- high leverage points

- collinearity

## Classification

- why not linear regression？

- hard to convert multiple (more than 2) class labels to quantitative numerical values

- even for binary responses, it does not make sense to use linear regression since the output may be outside [0,1] interval

- logistic regression

- determine coefficients

- use maximum likelihood method

- confounding

- the results obtained from using one predictor may be quite different from those obtained using multiple predictors, **especially when there is correlation among the predictors** (see the student-balance example in the text)

- linear discriminant analysis

- basic idea: classify data points using Bayesian approach, that is to find $k$ such that $$p_k(x)=frac{pi_kf_k(x)}{sum_l pi_lf_l(x)}$$ is maximum, where $pi_k$ is prior, $f_k(x)$ is likelihood function. We do further assumption that $f_k$ are Gaussian that share the same $Sigma$ but have different $mu_k$. $mu_k$, $Sigma$ and $pi_k$ can be estimated from training set, and then the decision boundary is determined based on Bayesian method by

$$delta_k(x) = -frac{1}{2}log|Sigma_k| - frac{1}{2}(x-mu_k)^TSigma_k^{-1}(x-mu_k) + log pi_k$$

since $Sigma_k$ are equal, we have

$$delta_k(x)=x^T Sigma^{-1}mu_k-frac{1}{2}mu_k^T Sigma^{-1}mu_k + log pi_k + const$$

This gives **linear decision boundary**. If $Sigma_k$ are different, **boundary may be quadratic**

- quadratic discriminant analysis

- it assumes that classes have different covariance matrices.

- LDA vs. QDA

- this is a typical bias-variance trade-off issue. Since QDA contains more parameters, it is better when there are a large number of data points or it is clear that different classes have different covariance matrices. In contrast, LDA works well when number of data is small.

- comparison of classification methods

- when true boundary is linear, it is better to use LDA or logistic regression, if is nonlinear, it may be better to use QDA or KNN, but the level of smoothness of KNN should be carefully chosen.

## Resampling methods

- cross validation

- leave-one-out cross-validation

- k-fold cross-validation

- bootstrap

## Linear model selection and regression

- subset selection

- best subset selection

- basic idea

- fix number of predictors k, find the best model $M_k$ for each k

- may use $R^2$, MSE, etc

- for all $M_k$, select the best one

- should use AIC, BIC, adjusted $R^2$, etc, since as number of predictors increases, MSE always decreases and $R^2$ always increases, AIC/BIC/adjusted $R^2$ includes penalization for the number of predictors

- definition of AIC/BIC/adjusted $R^2$

- AIC = $frac{1}{nhat{sigma}^2}(RSS+2dhat{sigma}^2)$

- BIC = $frac{1}{n}(RSS+log(n)dhat{sigma}^2)$

- adjusted $R^2 = 1- frac{RSS/(n-d-1)}{TSS/(n-1)}$

- note that in addition to using these metrics above related to training set directly, we may also use **cross validation** (estimate test error directly) to select the optimal model, which requires fewer assumptions about the model and is more general

- problem

- need to list all possibilities, it is super computationally expensive

- stepwise selection

- forward stepwise selection: also generate best model $M_k$ for each k, using greedy approach to add predictors one at a time, then select the best among $M_k$

- backward stepwise selection

- hybrid approaches

- shrinkage methods

- ridge regression

- basic idea: apply a L-2 norm regularization term

- since regularization term is affected by scaling of the training data, it is best to apply ridge regression after standardization of predictors

- pros and cons

- pros: very convenient for model selection, does not require to fit many different models

- cons: the model still depends on all predictors, instead of depending on a subset of predictors. Lasso method can solve this issue.

- lasso

- basic idea: apply a L-1 norm regularization term

- this method forces some of coefficients to **be exactly zero**, this performs variable selection (a diagram explaining this is shown in Fig. 6.7 in the text)

- another formulation for ridge and lasso, which reveals a close connection between lasso, ridge and best subset selection, see text

- comparing ridge and lasso (see text for details)

- ridge works better for models related to many predictors with coefficients of roughly equal size, while lasso works better for models where only a relatively small number of variables are important

- effects of two methods

- roughly speaking, ridge shrinks every dimension of data **by the same proportion**, while lasso shrinks every dimension towards 0 **by a similar amount**.

- Bayesian interpretation of ridge and lasso

- Bayesian formulation says there is a prior distribution $p(beta)$, and posterior distribution of $beta$ given $X,Y$ is $p(beta|X,Y)propto f(Y|X,beta)p(beta)$, training process is to find the maximum $p(beta|X,Y)$, when prior is Gaussian with zero mean, we get ridge, when prior is double-exponential, we get lasso.

- dimension reduction methods

- previous methods use **original predictors**, dimension reduction methods use **transformed predictors**.

- methods

- PCA

- note that standardization is typically needed

- partial least squares (PLS)

- motivation: in PCA, response Y is not used, only X is used in training in an **unsupervised** way. The principal directions extracted by PCA well explains predictors X, but may not be well correlated with Y.

- basic idea: to find first PLS direction, set the weight of each predictor $X_j$ to be the coefficient of simple linear regression of $Y$ onto $X_j$. To find subsequent PLS directions, take **orthogonalized data** and do processing iteratively.

- considerations in high dimensions

## Moving beyond linearity

- polynomial regression

- step functions

- basis functions

- regression splines

- piecewise polynomials

- constraints and splines

- the spline basis representation

- choosing the number and locations of the knots

- see "notes of numerical analysis course" for more information on this part

- comparison to polynomial regression

- regression splines are typically better and more stable than polynomial regression, due to its lower degree

- smoothing splines

- definition: $$sum_{i=1}^n (y_i-g(x_i))^2+lambda int g''(t)^2dt$$

it penalizes roughness of the function

- property: it can be proved that $g(x)$ that minimizes error function is a piecewise cubic polynomial with knots at the unique values of $\{x_i\}$

- local regression

- basic idea

- for each $x_0$, find a neighbor data set $K_0$, and train a model with weighted error (weight associated to each training data is determined by distance between this point to $x_0$), then use this local model to predict $y$ for $x_0$

- generalized additive models

- basic idea

- apply a nonlinear transformation to each feature and then do linear fitting on transformed features

## tree-based models

- basics of decision trees

- regression decision trees

- basic idea

- given predictors $X_1, ..., X_p$, find the optimal predictor to split $j$ and optimal cutpoint $s$, such that for two defined regions $R_1(j,s)=\{X|X_j

$$sum_{k=1,2}sum_{i:x_iin R_k}(y_i-hat{y}_{R_k})^2 $$

is minimized. Then pick one of these regions and repeat this process until stopping criterion is reached (e.g. each leaf node contains less than a specific number of data). This greedy approach is called **recursive binary splitting**

- pruning

- a complex tree may lead to overfitting, pruning the tree may solve this issue

- cost complexity pruning: try to minimize error function

$$sum_{m=1}^{|T|} sum_{i:x_i in R_m} (y_i-hat{y}_{R_m})^2+alpha |T|$$

- classification decision trees

- basic idea

- most part is the same, except that error function is Gini index or cross entropy

- pros and cons

- pros

- interpretability

- easy to handle both quantitative and qualitative predictors

- cons

- may not have good predictive accuracy (can be improved by bagging, boosting, etc)

- bagging, random forests, boosting

- bagging

- motivation: reduce prediction variance

- methods: train several un-pruned trees (which has high variance but low bias) with different training sets, and make prediction based on average outputs/majority vote

- how are different training sets generated: bootstrap

- error estimation

- in addition to cross-validation, we may use **out-of-bag** observations to estimate error. The basic idea is that when using bootstrap approach, each bagged tree contains about $(1-1/e)$ of all data, so the remaining $1/e$ of data (called "out-of-bag" data) can be used to estimate test error.

- variable importance measures

- in bagging model, we do not have the good interpretability of an individual decision tree, variance importance measures work as a metric to determine the relative importance of each feature

- random forests

- basic idea: similar to bagging, except that instead of using all predictors for splitting, we select a subset of predictors and request that the splitter can only be selected from this subset. This method **reduces correlation of the trees** and makes the prediction more reliable, especially when one of the predictors is particularly important compared with others.

- boosting

- basic idea: instead of building many trees in parallel, boosting builds trees sequentially: each tree fits the residue that has not been explained by previous trees. By controlling shrinkage parameter $lambda$, we can control how fast the new tree learns the residue.

## support vector machines

- see notes for course "advanced data science"

## unsupervised learning

- challenges of unsupervised learning

- PCA

- clustering

- K-means

- hierarchical clustering

- pros: does not require pre-specify number of clusters, one single dendrogram can be used to obtain any number of clusters.

本文由作者上传并发布（或网友转载），绿林网仅提供信息发布平台。文章仅代表作者个人观点，未经作者许可，不可转载。

点击查看全文