​#stats #books

exemple problème supervisé

- Spam : certains mots sont plus fréquents dans les mails classifiés
  spam

- PSA en fonction du poids de la prostate, Gleason etc : scatterplot
  mais non évident (données 1989)

- Reconnaissances chiffres sur image 16x16 bits normalisés

DNA microarray : pour une liste de gènes, on regarde la quantité d'ARN
qui s'hybride par rapport à un échantillon vs une référence (par
fluorescence, l'un est rouge l'autre vert). Avec plusieurs échantillons,
on a donc une matrice. Quels gènes sont prédictifs -> aprentissage non
supervisé

2. [@2] 
   ** Overview of supervised learning
   :PROPERTIES:
   :CUSTOM_ID: overview-of-supervised-learning
   :END:

K-nearest neighbors : pas d'hypothèse sur la distribution ("low bias")
mais variabilité importante ("high variance").\\
Moindres carrés : suppose que ce soit adapté ("high bias") mais stable
("low variance")

* Chap 2
:PROPERTIES:
:CUSTOM_ID: chap-2
:END:
** Linear vs nearest neighbours
:PROPERTIES:
:CUSTOM_ID: linear-vs-nearest-neighbours
:END:
- linear model fitted by least squared = stable but possibly inaccurate
- nearest neighbours = precise but unstable

If the data is a set of tightly lustered Gaussians (theirs means are
distributed as Gaussian), the optimal will be nonlinear and disjoint If
the data is a mixture of unrelated Gaussian, linear is almost optimal

** Bias, variance
:PROPERTIES:
:CUSTOM_ID: bias-variance
:END:
*** Bias
:PROPERTIES:
:CUSTOM_ID: bias
:END:
\[Bias_\theta(\hat{\theta})) = E_{x|\theta}(\hat{\theta}) - \theta\]
where \(E_{x|\theta}\) is the expected value over $(x|θ) (average over
all possible observations x with θ fixed)

Error from erroneous assumptions in the learning algorithm. High bias
can cause an algorithm to miss the relevant relations between features
and target outputs (underfitting). ### Variance :

Measure the dispersion : \[Variance(X) = E(X - E(X))^2\] Error from
sensitivity to small fluctuations in the training set. High variance may
result from an algorithm modeling the random noise in the training data
(overfitting).

** Curse of dimensionality = sparse sampling in hihg dimensions -> it's harder to have enough training data
:PROPERTIES:
:CUSTOM_ID: curse-of-dimensionality-sparse-sampling-in-hihg-dimensions---its-harder-to-have-enough-training-data
:END:

#+begin_quote
we saw that squared error loss lead us to the regression function f (x)
= E(Y |X = x) for a quantitative response. The class of nearest-neighbor
methods can be viewed as direct estimates of this conditional
expectation, but we have seen that they can fail in at least two ways: •
if the dimension of the input space is high, the nearest neighbors need
not be close to the target point, and can result in large errors; • if
special structure is known to exist, this can be used to reduce both the
bias and the variance of the estimates.

#+end_quote

Other models are design to overcome the dimensionality problems