powers/Sundog-ML-Course: materials/XGBoost.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using XGBoost is easy. Maybe too easy, considering it's generally considered the best ML algorithm around right now.\n",
    "\n",
    "To install it, just:\n",
    "\n",
    "pip install xgboost\n",
    "\n",
    "Let's experiment using the Iris data set. This data set includes the width and length of the petals and sepals of many Iris flowers, and the specific species of Iris the flower belongs to. Our challenge is to predict the species of a flower sample just based on the sizes of its petals. We'll revisit this data set later when we talk about principal component analysis too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "150\n",
      "4\n",
      "['setosa', 'versicolor', 'virginica']\n"
     ]
    }
   ],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "\n",
    "iris = load_iris()\n",
    "\n",
    "numSamples, numFeatures = iris.data.shape\n",
    "print(numSamples)\n",
    "print(numFeatures)\n",
    "print(list(iris.target_names))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's divide our data into 20% reserved for testing our model, and the remaining 80% to train it with. By withholding our test data, we can make sure we're evaluating its results based on new flowers it hasn't seen before. Typically we refer to our features (in this case, the petal sizes) as X, and the labels (in this case, the species) as y."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll load up XGBoost, and convert our data into the DMatrix format it expects. One for the training data, and one for the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import xgboost as xgb\n",
    "\n",
    "train = xgb.DMatrix(X_train, label=y_train)\n",
    "test = xgb.DMatrix(X_test, label=y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll define our hyperparameters. We're choosing softmax since this is a multiple classification problem, but the other parameters should ideally be tuned through experimentation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "param = {\n",
    "    'max_depth': 4,\n",
    "    'eta': 0.3,\n",
    "    'objective': 'multi:softmax',\n",
    "    'num_class': 3} \n",
    "epochs = 10 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's go ahead and train our model using these parameters as a first guess."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = xgb.train(param, train, epochs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll use the trained model to predict classifications for the data we set aside for testing. Each classification number we get back corresponds to a specific species of Iris."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = model.predict(test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2. 1. 0. 2. 0. 2. 0. 1. 1. 1. 2. 1. 1. 1. 1. 0. 1. 1. 0. 0. 2. 1. 0. 0.\n",
      " 2. 0. 0. 1. 1. 0.]\n"
     ]
    }
   ],
   "source": [
    "print(predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's measure the accuracy on the test data..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "accuracy_score(y_test, predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Holy crow! It's perfect, and that's just with us guessing as to the best hyperparameters!\n",
    "\n",
    "Normally I'd have you experiment to find better hyperparameters as an activity, but you can't improve on those results. Instead, see what it takes to make the results worse! How few epochs (iterations) can I get away with? How low can I set the max_depth? Basically try to optimize the simplicity and performance of the model, now that you already have perfect accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}