powers/Sundog-ML-Course: materials/DeepLearningProject-Solution.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Final Project\n",
    "\n",
    "## Predict whether a mammogram mass is benign or malignant\n",
    "\n",
    "We'll be using the \"mammographic masses\" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)\n",
    "\n",
    "This data contains 961 instances of masses detected in mammograms, and contains the following attributes:\n",
    "\n",
    "\n",
    "   1. BI-RADS assessment: 1 to 5 (ordinal)  \n",
    "   2. Age: patient's age in years (integer)\n",
    "   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)\n",
    "   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)\n",
    "   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)\n",
    "   6. Severity: benign=0 or malignant=1 (binominal)\n",
    "   \n",
    "BI-RADS is an assesment of how confident the severity classification is; it is not a \"predictive\" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and \"severity\" is the classification we will attempt to predict based on those attributes.\n",
    "\n",
    "Although \"shape\" and \"margin\" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The \"shape\" for example is ordered increasingly from round to irregular.\n",
    "\n",
    "A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.\n",
    "\n",
    "## Your assignment\n",
    "\n",
    "Build a Multi-Layer Perceptron and train it to classify masses as benign or malignant based on its features.\n",
    "\n",
    "The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.\n",
    "\n",
    "Remember to normalize your data first! And experiment with different topologies, optimizers, and hyperparameters.\n",
    "\n",
    "I was able to achieve over 80% accuracy - can you beat that?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's begin: prepare your data\n",
    "\n",
    "Start by importing the mammographic_masses.data.txt file into a Pandas dataframe (hint: use read_csv) and take a look at it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>5</th>\n",
       "      <th>67</th>\n",
       "      <th>3</th>\n",
       "      <th>5.1</th>\n",
       "      <th>3.1</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4</td>\n",
       "      <td>43</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>?</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>58</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>28</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>74</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>?</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>65</td>\n",
       "      <td>1</td>\n",
       "      <td>?</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   5  67  3 5.1 3.1  1\n",
       "0  4  43  1   1   ?  1\n",
       "1  5  58  4   5   3  1\n",
       "2  4  28  1   1   3  0\n",
       "3  5  74  1   5   ?  1\n",
       "4  4  65  1   ?   3  0"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "masses_data = pd.read_csv('mammographic_masses.data.txt')\n",
    "masses_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make sure you use the optional parmaters in read_csv to convert missing data (indicated by a ?) into NaN, and to add the appropriate column names (BI_RADS, age, shape, margin, density, and severity):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>BI-RADS</th>\n",
       "      <th>age</th>\n",
       "      <th>shape</th>\n",
       "      <th>margin</th>\n",
       "      <th>density</th>\n",
       "      <th>severity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.0</td>\n",
       "      <td>67.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.0</td>\n",
       "      <td>43.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.0</td>\n",
       "      <td>28.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>74.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   BI-RADS   age  shape  margin  density  severity\n",
       "0      5.0  67.0    3.0     5.0      3.0         1\n",
       "1      4.0  43.0    1.0     1.0      NaN         1\n",
       "2      5.0  58.0    4.0     5.0      3.0         1\n",
       "3      4.0  28.0    1.0     1.0      3.0         0\n",
       "4      5.0  74.0    1.0     5.0      NaN         1"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "masses_data = pd.read_csv('mammographic_masses.data.txt', na_values=['?'], names = ['BI-RADS', 'age', 'shape', 'margin', 'density', 'severity'])\n",
    "masses_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evaluate whether the data needs cleaning; your model is only as good as the data it's given. Hint: use describe() on the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>BI-RADS</th>\n",
       "      <th>age</th>\n",
       "      <th>shape</th>\n",
       "      <th>margin</th>\n",
       "      <th>density</th>\n",
       "      <th>severity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>959.000000</td>\n",
       "      <td>956.000000</td>\n",
       "      <td>930.000000</td>\n",
       "      <td>913.000000</td>\n",
       "      <td>885.000000</td>\n",
       "      <td>961.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>4.348279</td>\n",
       "      <td>55.487448</td>\n",
       "      <td>2.721505</td>\n",
       "      <td>2.796276</td>\n",
       "      <td>2.910734</td>\n",
       "      <td>0.463059</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.783031</td>\n",
       "      <td>14.480131</td>\n",
       "      <td>1.242792</td>\n",
       "      <td>1.566546</td>\n",
       "      <td>0.380444</td>\n",
       "      <td>0.498893</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>18.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>4.000000</td>\n",
       "      <td>45.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>4.000000</td>\n",
       "      <td>57.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>5.000000</td>\n",
       "      <td>66.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>55.000000</td>\n",
       "      <td>96.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          BI-RADS         age       shape      margin     density    severity\n",
       "count  959.000000  956.000000  930.000000  913.000000  885.000000  961.000000\n",
       "mean     4.348279   55.487448    2.721505    2.796276    2.910734    0.463059\n",
       "std      1.783031   14.480131    1.242792    1.566546    0.380444    0.498893\n",
       "min      0.000000   18.000000    1.000000    1.000000    1.000000    0.000000\n",
       "25%      4.000000   45.000000    2.000000    1.000000    3.000000    0.000000\n",
       "50%      4.000000   57.000000    3.000000    3.000000    3.000000    0.000000\n",
       "75%      5.000000   66.000000    4.000000    4.000000    3.000000    1.000000\n",
       "max     55.000000   96.000000    4.000000    5.000000    4.000000    1.000000"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "masses_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are quite a few missing values in the data set. Before we just drop every row that's missing data, let's make sure we don't bias our data in doing so. Does there appear to be any sort of correlation to what sort of data has missing fields? If there were, we'd have to try and go back and fill that data in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>BI-RADS</th>\n",
       "      <th>age</th>\n",
       "      <th>shape</th>\n",
       "      <th>margin</th>\n",
       "      <th>density</th>\n",
       "      <th>severity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.0</td>\n",
       "      <td>43.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>74.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>4.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>4.0</td>\n",
       "      <td>70.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>5.0</td>\n",
       "      <td>42.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>778</th>\n",
       "      <td>4.0</td>\n",
       "      <td>60.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>819</th>\n",
       "      <td>4.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>824</th>\n",
       "      <td>6.0</td>\n",
       "      <td>40.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>884</th>\n",
       "      <td>5.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>923</th>\n",
       "      <td>5.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>130 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     BI-RADS   age  shape  margin  density  severity\n",
       "1        4.0  43.0    1.0     1.0      NaN         1\n",
       "4        5.0  74.0    1.0     5.0      NaN         1\n",
       "5        4.0  65.0    1.0     NaN      3.0         0\n",
       "6        4.0  70.0    NaN     NaN      3.0         0\n",
       "7        5.0  42.0    1.0     NaN      3.0         0\n",
       "..       ...   ...    ...     ...      ...       ...\n",
       "778      4.0  60.0    NaN     4.0      3.0         0\n",
       "819      4.0  35.0    3.0     NaN      2.0         0\n",
       "824      6.0  40.0    NaN     3.0      4.0         1\n",
       "884      5.0   NaN    4.0     4.0      3.0         1\n",
       "923      5.0   NaN    4.0     3.0      3.0         1\n",
       "\n",
       "[130 rows x 6 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "masses_data.loc[(masses_data['age'].isnull()) |\n",
    "              (masses_data['shape'].isnull()) |\n",
    "              (masses_data['margin'].isnull()) |\n",
    "              (masses_data['density'].isnull())]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the missing data seems randomly distributed, go ahead and drop rows with missing data. Hint: use dropna()."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>BI-RADS</th>\n",
       "      <th>age</th>\n",
       "      <th>shape</th>\n",
       "      <th>margin</th>\n",
       "      <th>density</th>\n",
       "      <th>severity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>830.000000</td>\n",
       "      <td>830.000000</td>\n",
       "      <td>830.000000</td>\n",
       "      <td>830.000000</td>\n",
       "      <td>830.000000</td>\n",
       "      <td>830.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>4.393976</td>\n",
       "      <td>55.781928</td>\n",
       "      <td>2.781928</td>\n",
       "      <td>2.813253</td>\n",
       "      <td>2.915663</td>\n",
       "      <td>0.485542</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.888371</td>\n",
       "      <td>14.671782</td>\n",
       "      <td>1.242361</td>\n",
       "      <td>1.567175</td>\n",
       "      <td>0.350936</td>\n",
       "      <td>0.500092</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>18.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>4.000000</td>\n",
       "      <td>46.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>4.000000</td>\n",
       "      <td>57.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>5.000000</td>\n",
       "      <td>66.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>55.000000</td>\n",
       "      <td>96.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          BI-RADS         age       shape      margin     density    severity\n",
       "count  830.000000  830.000000  830.000000  830.000000  830.000000  830.000000\n",
       "mean     4.393976   55.781928    2.781928    2.813253    2.915663    0.485542\n",
       "std      1.888371   14.671782    1.242361    1.567175    0.350936    0.500092\n",
       "min      0.000000   18.000000    1.000000    1.000000    1.000000    0.000000\n",
       "25%      4.000000   46.000000    2.000000    1.000000    3.000000    0.000000\n",
       "50%      4.000000   57.000000    3.000000    3.000000    3.000000    0.000000\n",
       "75%      5.000000   66.000000    4.000000    4.000000    3.000000    1.000000\n",
       "max     55.000000   96.000000    4.000000    5.000000    4.000000    1.000000"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "masses_data.dropna(inplace=True)\n",
    "masses_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Next you'll need to convert the Pandas dataframes into numpy arrays that can be used by scikit_learn. Create an array that extracts only the feature data we want to work with (age, shape, margin, and density) and another array that contains the classes (severity). You'll also need an array of the feature name labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[67.,  3.,  5.,  3.],\n",
       "       [58.,  4.,  5.,  3.],\n",
       "       [28.,  1.,  1.,  3.],\n",
       "       ...,\n",
       "       [64.,  4.,  5.,  3.],\n",
       "       [66.,  4.,  5.,  3.],\n",
       "       [62.,  3.,  3.,  3.]])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_features = masses_data[['age', 'shape',\n",
    "                             'margin', 'density']].values\n",
    "\n",
    "\n",
    "all_classes = masses_data['severity'].values\n",
    "\n",
    "feature_names = ['age', 'shape', 'margin', 'density']\n",
    "\n",
    "all_features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of our models require the input data to be normalized, so go ahead and normalize the attribute data. Hint: use preprocessing.StandardScaler()."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.7650629 ,  0.17563638,  1.39618483,  0.24046607],\n",
       "       [ 0.15127063,  0.98104077,  1.39618483,  0.24046607],\n",
       "       [-1.89470363, -1.43517241, -1.157718  ,  0.24046607],\n",
       "       ...,\n",
       "       [ 0.56046548,  0.98104077,  1.39618483,  0.24046607],\n",
       "       [ 0.69686376,  0.98104077,  1.39618483,  0.24046607],\n",
       "       [ 0.42406719,  0.17563638,  0.11923341,  0.24046607]])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import preprocessing\n",
    "\n",
    "scaler = preprocessing.StandardScaler()\n",
    "all_features_scaled = scaler.fit_transform(all_features)\n",
    "all_features_scaled"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now set up an actual MLP model using Keras:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From C:\\Users\\fkane\\anaconda3\\lib\\site-packages\\keras\\src\\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from tensorflow.keras.layers import Dense\n",
    "from tensorflow.keras.models import Sequential\n",
    "\n",
    "def create_model():\n",
    "    model = Sequential()\n",
    "    #4 feature inputs going into an 6-unit layer (more does not seem to help - in fact you can go down to 4)\n",
    "    model.add(Dense(6, input_dim=4, kernel_initializer='normal', activation='relu'))\n",
    "    # \"Deep learning\" turns out to be unnecessary - this additional hidden layer doesn't help either.\n",
    "    #model.add(Dense(4, kernel_initializer='normal', activation='relu'))\n",
    "    # Output layer with a binary classification (benign or malignant)\n",
    "    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))\n",
    "    # Compile model; rmsprop seemed to work best\n",
    "    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
    "    return model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From C:\\Users\\fkane\\anaconda3\\lib\\site-packages\\keras\\src\\backend.py:873: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.\n",
      "\n",
      "WARNING:tensorflow:From C:\\Users\\fkane\\anaconda3\\lib\\site-packages\\keras\\src\\optimizers\\__init__.py:309: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.\n",
      "\n",
      "WARNING:tensorflow:From C:\\Users\\fkane\\anaconda3\\lib\\site-packages\\keras\\src\\utils\\tf_utils.py:492: The name tf.ragged.RaggedTensorValue is deprecated. Please use tf.compat.v1.ragged.RaggedTensorValue instead.\n",
      "\n",
      "WARNING:tensorflow:From C:\\Users\\fkane\\anaconda3\\lib\\site-packages\\keras\\src\\engine\\base_layer_utils.py:384: The name tf.executing_eagerly_outside_functions is deprecated. Please use tf.compat.v1.executing_eagerly_outside_functions instead.\n",
      "\n",
      "WARNING:tensorflow:5 out of the last 13 calls to <function Model.make_predict_function.<locals>.predict_function at 0x000002D0DF0F45E0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.\n",
      "WARNING:tensorflow:5 out of the last 13 calls to <function Model.make_predict_function.<locals>.predict_function at 0x000002D0DF2360D0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.8060240963855421"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.model_selection import cross_val_score\n",
    "from scikeras.wrappers import KerasClassifier\n",
    "\n",
    "# Wrap our Keras model in an estimator compatible with scikit_learn\n",
    "estimator = KerasClassifier(model=create_model, epochs=100, verbose=0)\n",
    "# Now we can use scikit_learn's cross_val_score to evaluate this model identically to the others\n",
    "cv_scores = cross_val_score(estimator, all_features_scaled, all_classes, cv=10)\n",
    "cv_scores.mean()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## How did you do?\n",
    "\n",
    "Which topology, and which choice of hyperparameters, performed the best? Feel free to share your results!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}