powers/Sundog-ML-Course: materials/KFoldCrossValidation.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# K-Fold Cross Validation"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "Let's revisit the Iris data set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.model_selection import cross_val_score, train_test_split\n",
    "from sklearn import datasets\n",
    "from sklearn import svm\n",
    "\n",
    "iris = datasets.load_iris()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A single train/test split is made easy with the train_test_split function in the cross_validation library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9666666666666667"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Split the iris data into train/test data sets with 40% reserved for testing\n",
    "X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)\n",
    "\n",
    "# Build an SVC model for predicting iris classifications using training data\n",
    "clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)\n",
    "\n",
    "# Now measure its performance with the test data\n",
    "clf.score(X_test, y_test)   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "K-Fold cross validation is just as easy; let's use a K of 5:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.96666667 1.         0.96666667 0.96666667 1.        ]\n",
      "0.9800000000000001\n"
     ]
    }
   ],
   "source": [
    "# We give cross_val_score a model, the entire data set and its \"real\" values, and the number of folds:\n",
    "scores = cross_val_score(clf, iris.data, iris.target, cv=5)\n",
    "\n",
    "# Print the accuracy for each fold:\n",
    "print(scores)\n",
    "\n",
    "# And the mean accuracy of all 5 folds:\n",
    "print(scores.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our model is even better than we thought! Can we do better? Let's try a different kernel (poly):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1.         1.         0.9        0.93333333 1.        ]\n",
      "0.9666666666666666\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "E:\\Anaconda3\\lib\\site-packages\\sklearn\\svm\\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.\n",
      "  \"avoid this warning.\", FutureWarning)\n",
      "E:\\Anaconda3\\lib\\site-packages\\sklearn\\svm\\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.\n",
      "  \"avoid this warning.\", FutureWarning)\n",
      "E:\\Anaconda3\\lib\\site-packages\\sklearn\\svm\\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.\n",
      "  \"avoid this warning.\", FutureWarning)\n",
      "E:\\Anaconda3\\lib\\site-packages\\sklearn\\svm\\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.\n",
      "  \"avoid this warning.\", FutureWarning)\n",
      "E:\\Anaconda3\\lib\\site-packages\\sklearn\\svm\\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.\n",
      "  \"avoid this warning.\", FutureWarning)\n"
     ]
    }
   ],
   "source": [
    "clf = svm.SVC(kernel='poly', C=1)\n",
    "scores = cross_val_score(clf, iris.data, iris.target, cv=5)\n",
    "print(scores)\n",
    "print(scores.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No! The more complex polynomial kernel produced lower accuracy than a simple linear kernel. The polynomial kernel is overfitting. But we couldn't have told that with a single train/test split:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "E:\\Anaconda3\\lib\\site-packages\\sklearn\\svm\\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.\n",
      "  \"avoid this warning.\", FutureWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.9666666666666667"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Build an SVC model for predicting iris classifications using training data\n",
    "clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)\n",
    "\n",
    "# Now measure its performance with the test data\n",
    "clf.score(X_test, y_test)   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's the same score we got with a single train/test split on the linear kernel."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Activity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The \"poly\" kernel for SVC actually has another attribute for the number of degrees of the polynomial used, which defaults to 3. For example, svm.SVC(kernel='poly', degree=3, C=1)\n",
    "\n",
    "We think the default third-degree polynomial is overfitting, based on the results above. But how about 2? Give that a try and compare it to the linear kernel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}