#stats #books exemple problème supervisé - Spam : certains mots sont plus fréquents dans les mails classifiés spam - PSA en fonction du poids de la prostate, Gleason etc : scatterplot mais non évident (données 1989) - Reconnaissances chiffres sur image 16x16 bits normalisés DNA microarray : pour une liste de gènes, on regarde la quantité d'ARN qui s'hybride par rapport à un échantillon vs une référence (par fluorescence, l'un est rouge l'autre vert). Avec plusieurs échantillons, on a donc une matrice. Quels gènes sont prédictifs -> aprentissage non supervisé 2. [@2] ** Overview of supervised learning :PROPERTIES: :CUSTOM_ID: overview-of-supervised-learning :END: K-nearest neighbors : pas d'hypothèse sur la distribution ("low bias") mais variabilité importante ("high variance").\\ Moindres carrés : suppose que ce soit adapté ("high bias") mais stable ("low variance") * Chap 2 :PROPERTIES: :CUSTOM_ID: chap-2 :END: ** Linear vs nearest neighbours :PROPERTIES: :CUSTOM_ID: linear-vs-nearest-neighbours :END: - linear model fitted by least squared = stable but possibly inaccurate - nearest neighbours = precise but unstable If the data is a set of tightly lustered Gaussians (theirs means are distributed as Gaussian), the optimal will be nonlinear and disjoint If the data is a mixture of unrelated Gaussian, linear is almost optimal ** Bias, variance :PROPERTIES: :CUSTOM_ID: bias-variance :END: *** Bias :PROPERTIES: :CUSTOM_ID: bias :END: \[Bias_\theta(\hat{\theta})) = E_{x|\theta}(\hat{\theta}) - \theta\] where \(E_{x|\theta}\) is the expected value over $(x|θ) (average over all possible observations x with θ fixed) Error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). ### Variance : Measure the dispersion : \[Variance(X) = E(X - E(X))^2\] Error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting). ** Curse of dimensionality = sparse sampling in hihg dimensions -> it's harder to have enough training data :PROPERTIES: :CUSTOM_ID: curse-of-dimensionality-sparse-sampling-in-hihg-dimensions---its-harder-to-have-enough-training-data :END: #+begin_quote we saw that squared error loss lead us to the regression function f (x) = E(Y |X = x) for a quantitative response. The class of nearest-neighbor methods can be viewed as direct estimates of this conditional expectation, but we have seen that they can fail in at least two ways: • if the dimension of the input space is high, the nearest neighbors need not be close to the target point, and can result in large errors; • if special structure is known to exist, this can be used to reduce both the bias and the variance of the estimates. #+end_quote Other models are design to overcome the dimensionality problems