powers/Sundog-ML-Course: materials/Keras-RNN.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Recurring Neural Networks with Keras\n",
    "\n",
    "## Sentiment analysis from movie reviews\n",
    "\n",
    "This notebook is inspired by the imdb_lstm.py example that ships with Keras. But since I used to run IMDb's engineering department, I couldn't resist!\n",
    "\n",
    "It's actually a great example of using RNN's. The data set we're using consists of user-generated movie reviews and classification of whether the user liked the movie or not based on its associated rating.\n",
    "\n",
    "More info on the dataset is here:\n",
    "\n",
    "https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification\n",
    "\n",
    "So we are going to use an RNN to do sentiment analysis on full-text movie reviews!\n",
    "\n",
    "Think about how amazing this is. We're going to train an artificial neural network how to \"read\" movie reviews and guess  whether the author liked the movie or not from them.\n",
    "\n",
    "Since understanding written language requires keeping track of all the words in a sentence, we need a recurrent neural network to keep a \"memory\" of the words that have come before as it \"reads\" sentences over time.\n",
    "\n",
    "In particular, we'll use LSTM (Long Short-Term Memory) cells because we don't really want to \"forget\" words too quickly - words early on in a sentence can affect the meaning of that sentence significantly.\n",
    "\n",
    "Let's start by importing the stuff we need:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras.preprocessing import sequence\n",
    "from tensorflow.keras.models import Sequential\n",
    "from tensorflow.keras.layers import Dense, Embedding\n",
    "from tensorflow.keras.layers import LSTM\n",
    "from tensorflow.keras.datasets import imdb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now import our training and testing data. We specify that we only care about the 20,000 most popular words in the dataset in order to keep things somewhat managable. The dataset includes 5,000 training reviews and 25,000 testing reviews for some reason."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading data...\n"
     ]
    }
   ],
   "source": [
    "print('Loading data...')\n",
    "(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=20000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get a feel for what this data looks like. Let's look at the first training feature, which should represent a written movie review:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1,\n",
       " 14,\n",
       " 22,\n",
       " 16,\n",
       " 43,\n",
       " 530,\n",
       " 973,\n",
       " 1622,\n",
       " 1385,\n",
       " 65,\n",
       " 458,\n",
       " 4468,\n",
       " 66,\n",
       " 3941,\n",
       " 4,\n",
       " 173,\n",
       " 36,\n",
       " 256,\n",
       " 5,\n",
       " 25,\n",
       " 100,\n",
       " 43,\n",
       " 838,\n",
       " 112,\n",
       " 50,\n",
       " 670,\n",
       " 2,\n",
       " 9,\n",
       " 35,\n",
       " 480,\n",
       " 284,\n",
       " 5,\n",
       " 150,\n",
       " 4,\n",
       " 172,\n",
       " 112,\n",
       " 167,\n",
       " 2,\n",
       " 336,\n",
       " 385,\n",
       " 39,\n",
       " 4,\n",
       " 172,\n",
       " 4536,\n",
       " 1111,\n",
       " 17,\n",
       " 546,\n",
       " 38,\n",
       " 13,\n",
       " 447,\n",
       " 4,\n",
       " 192,\n",
       " 50,\n",
       " 16,\n",
       " 6,\n",
       " 147,\n",
       " 2025,\n",
       " 19,\n",
       " 14,\n",
       " 22,\n",
       " 4,\n",
       " 1920,\n",
       " 4613,\n",
       " 469,\n",
       " 4,\n",
       " 22,\n",
       " 71,\n",
       " 87,\n",
       " 12,\n",
       " 16,\n",
       " 43,\n",
       " 530,\n",
       " 38,\n",
       " 76,\n",
       " 15,\n",
       " 13,\n",
       " 1247,\n",
       " 4,\n",
       " 22,\n",
       " 17,\n",
       " 515,\n",
       " 17,\n",
       " 12,\n",
       " 16,\n",
       " 626,\n",
       " 18,\n",
       " 19193,\n",
       " 5,\n",
       " 62,\n",
       " 386,\n",
       " 12,\n",
       " 8,\n",
       " 316,\n",
       " 8,\n",
       " 106,\n",
       " 5,\n",
       " 4,\n",
       " 2223,\n",
       " 5244,\n",
       " 16,\n",
       " 480,\n",
       " 66,\n",
       " 3785,\n",
       " 33,\n",
       " 4,\n",
       " 130,\n",
       " 12,\n",
       " 16,\n",
       " 38,\n",
       " 619,\n",
       " 5,\n",
       " 25,\n",
       " 124,\n",
       " 51,\n",
       " 36,\n",
       " 135,\n",
       " 48,\n",
       " 25,\n",
       " 1415,\n",
       " 33,\n",
       " 6,\n",
       " 22,\n",
       " 12,\n",
       " 215,\n",
       " 28,\n",
       " 77,\n",
       " 52,\n",
       " 5,\n",
       " 14,\n",
       " 407,\n",
       " 16,\n",
       " 82,\n",
       " 10311,\n",
       " 8,\n",
       " 4,\n",
       " 107,\n",
       " 117,\n",
       " 5952,\n",
       " 15,\n",
       " 256,\n",
       " 4,\n",
       " 2,\n",
       " 7,\n",
       " 3766,\n",
       " 5,\n",
       " 723,\n",
       " 36,\n",
       " 71,\n",
       " 43,\n",
       " 530,\n",
       " 476,\n",
       " 26,\n",
       " 400,\n",
       " 317,\n",
       " 46,\n",
       " 7,\n",
       " 4,\n",
       " 12118,\n",
       " 1029,\n",
       " 13,\n",
       " 104,\n",
       " 88,\n",
       " 4,\n",
       " 381,\n",
       " 15,\n",
       " 297,\n",
       " 98,\n",
       " 32,\n",
       " 2071,\n",
       " 56,\n",
       " 26,\n",
       " 141,\n",
       " 6,\n",
       " 194,\n",
       " 7486,\n",
       " 18,\n",
       " 4,\n",
       " 226,\n",
       " 22,\n",
       " 21,\n",
       " 134,\n",
       " 476,\n",
       " 26,\n",
       " 480,\n",
       " 5,\n",
       " 144,\n",
       " 30,\n",
       " 5535,\n",
       " 18,\n",
       " 51,\n",
       " 36,\n",
       " 28,\n",
       " 224,\n",
       " 92,\n",
       " 25,\n",
       " 104,\n",
       " 4,\n",
       " 226,\n",
       " 65,\n",
       " 16,\n",
       " 38,\n",
       " 1334,\n",
       " 88,\n",
       " 12,\n",
       " 16,\n",
       " 283,\n",
       " 5,\n",
       " 16,\n",
       " 4472,\n",
       " 113,\n",
       " 103,\n",
       " 32,\n",
       " 15,\n",
       " 16,\n",
       " 5345,\n",
       " 19,\n",
       " 178,\n",
       " 32]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x_train[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That doesn't look like a movie review! But this data set has spared you a lot of trouble - they have already converted words to integer-based indices. The actual letters that make up a word don't really matter as far as our model is concerned, what matters are the words themselves - and our model needs numbers to work with, not letters.\n",
    "\n",
    "So just keep in mind that each number in the training features represent some specific word. It's a bummer that we can't just read the reviews in English as a gut check to see if sentiment analysis is really working, though.\n",
    "\n",
    "What do the labels look like?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_train[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "They are just 0 or 1, which indicates whether the reviewer said they liked the movie or not.\n",
    "\n",
    "So to recap, we have a bunch of movie reviews that have been converted into vectors of words represented by integers, and a binary sentiment classification to learn from.\n",
    "\n",
    "RNN's can blow up quickly, so again to keep things managable on our little PC let's limit the reviews to their first 80 words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_train = sequence.pad_sequences(x_train, maxlen=80)\n",
    "x_test = sequence.pad_sequences(x_test, maxlen=80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's set up our neural network model! Considering how complicated a LSTM recurrent neural network is under the hood, it's really amazing how easy this is to do with Keras.\n",
    "\n",
    "We will start with an Embedding layer - this is just a step that converts the input data into dense vectors of fixed size that's better suited for a neural network. You generally see this in conjunction with index-based text data like we have here. The 20,000 indicates the vocabulary size (remember we said we only wanted the top 20,000 words) and 128 is the output dimension of 128 units.\n",
    "\n",
    "Next we just have to set up a LSTM layer for the RNN itself. It's that easy. We specify 128 to match the output size of the Embedding layer, and dropout terms to avoid overfitting, which RNN's are particularly prone to.\n",
    "\n",
    "Finally we just need to boil it down to a single neuron with a sigmoid activation function to choose our binay sentiment classification of 0 or 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = Sequential()\n",
    "model.add(Embedding(20000, 128))\n",
    "model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))\n",
    "model.add(Dense(1, activation='sigmoid'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As this is a binary classification problem, we'll use the binary_crossentropy loss function. And the Adam optimizer is usually a good choice (feel free to try others.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.compile(loss='binary_crossentropy',\n",
    "              optimizer='adam',\n",
    "              metrics=['accuracy'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will actually train our model. RNN's, like CNN's, are very resource heavy. Keeping the batch size relatively small is the key to enabling this to run on your PC at all. In the real word of course, you'd be taking advantage of GPU's installed across many computers on a cluster to make this scale a lot better.\n",
    "\n",
    "## Warning\n",
    "\n",
    "This will take a very long time to run, even on a fast PC! Don't execute the next blocks unless you're prepared to tie up your computer for an hour or more."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's kick off the training. Even with a GPU, this will take a long time!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 25000 samples, validate on 25000 samples\n",
      "Epoch 1/15\n",
      "25000/25000 - 83s - loss: 0.4605 - accuracy: 0.7798 - val_loss: 0.3795 - val_accuracy: 0.8349\n",
      "Epoch 2/15\n",
      "25000/25000 - 80s - loss: 0.3014 - accuracy: 0.8780 - val_loss: 0.4075 - val_accuracy: 0.8237\n",
      "Epoch 3/15\n",
      "25000/25000 - 81s - loss: 0.2124 - accuracy: 0.9182 - val_loss: 0.4114 - val_accuracy: 0.8238\n",
      "Epoch 4/15\n",
      "25000/25000 - 81s - loss: 0.1520 - accuracy: 0.9445 - val_loss: 0.4581 - val_accuracy: 0.8271\n",
      "Epoch 5/15\n",
      "25000/25000 - 79s - loss: 0.1097 - accuracy: 0.9603 - val_loss: 0.5656 - val_accuracy: 0.8252\n",
      "Epoch 6/15\n",
      "25000/25000 - 78s - loss: 0.0772 - accuracy: 0.9724 - val_loss: 0.6683 - val_accuracy: 0.8210\n",
      "Epoch 7/15\n",
      "25000/25000 - 79s - loss: 0.0622 - accuracy: 0.9790 - val_loss: 0.7128 - val_accuracy: 0.8209\n",
      "Epoch 8/15\n",
      "25000/25000 - 81s - loss: 0.0366 - accuracy: 0.9877 - val_loss: 0.9249 - val_accuracy: 0.7985\n",
      "Epoch 9/15\n",
      "25000/25000 - 79s - loss: 0.0291 - accuracy: 0.9912 - val_loss: 0.9049 - val_accuracy: 0.8130\n",
      "Epoch 10/15\n",
      "25000/25000 - 85s - loss: 0.0255 - accuracy: 0.9914 - val_loss: 0.8627 - val_accuracy: 0.8105\n",
      "Epoch 11/15\n",
      "25000/25000 - 86s - loss: 0.0222 - accuracy: 0.9930 - val_loss: 0.9969 - val_accuracy: 0.8167\n",
      "Epoch 12/15\n",
      "25000/25000 - 82s - loss: 0.0178 - accuracy: 0.9940 - val_loss: 1.0226 - val_accuracy: 0.8143\n",
      "Epoch 13/15\n",
      "25000/25000 - 88s - loss: 0.0109 - accuracy: 0.9970 - val_loss: 1.1139 - val_accuracy: 0.8104\n",
      "Epoch 14/15\n",
      "25000/25000 - 89s - loss: 0.0094 - accuracy: 0.9971 - val_loss: 1.2101 - val_accuracy: 0.8125\n",
      "Epoch 15/15\n",
      "25000/25000 - 86s - loss: 0.0066 - accuracy: 0.9980 - val_loss: 1.1447 - val_accuracy: 0.8053\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<tensorflow.python.keras.callbacks.History at 0x1abc5012128>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.fit(x_train, y_train,\n",
    "          batch_size=32,\n",
    "          epochs=15,\n",
    "          verbose=2,\n",
    "          validation_data=(x_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, let's evaluate our model's accuracy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "25000/1 - 14s - loss: 1.3717 - accuracy: 0.8053\n",
      "Test score: 1.144708181578219\n",
      "Test accuracy: 0.80528\n"
     ]
    }
   ],
   "source": [
    "score, acc = model.evaluate(x_test, y_test,\n",
    "                            batch_size=32,\n",
    "                            verbose=2)\n",
    "print('Test score:', score)\n",
    "print('Test accuracy:', acc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "80% eh? Not too bad, considering we limited ourselves to just the first 80 words of each review.\n",
    "\n",
    "Note that the validation accuracy while we were training never really improved after the first epoch; we're likely just overfitting. This is a case where early stopping would have been beneficial.\n",
    "\n",
    "But again - stop and think about what we just made here! A neural network that can \"read\" reviews and deduce whether the author liked the movie or not based on that text. And it takes the context of each word and its position in the review into account - and setting up the model itself was just a few lines of code! It's pretty incredible what you can do with Keras."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}