powers/Sundog-ML-Course: materials/TTest.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# T-Tests and P-Values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's say we're running an A/B test. We'll fabricate some data that randomly assigns order amounts from customers in sets A and B, with B being a little bit higher:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_indResult(statistic=-14.633287515087083, pvalue=3.0596466523801155e-48)"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from scipy import stats\n",
    "\n",
    "A = np.random.normal(25.0, 5.0, 10000)\n",
    "B = np.random.normal(26.0, 5.0, 10000)\n",
    "\n",
    "stats.ttest_ind(A, B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The t-statistic is a measure of the difference between the two sets expressed in units of standard error. Put differently, it's the size of the difference relative to the variance in the data. A high t value means there's probably a real difference between the two sets; you have \"significance\". The P-value is a measure of the probability of an observation lying at extreme t-values; so a low p-value also implies \"significance.\" If you're looking for a \"statistically significant\" result, you want to see a very low p-value and a high t-statistic (well, a high absolute value of the t-statistic more precisely). In the real world, statisticians seem to put more weight on the p-value result.\n",
    "\n",
    "Let's change things up so both A and B are just random, generated under the same parameters. So there's no \"real\" difference between the two:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_indResult(statistic=-0.5632806182026571, pvalue=0.5732501292213643)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "B = np.random.normal(25.0, 5.0, 10000)\n",
    "\n",
    "stats.ttest_ind(A, B)"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "Now, our t-statistic is much lower and our p-value is really high. This supports the null hypothesis - that there is no real difference in behavior between these two sets.\n",
    "\n",
    "Does the sample size make a difference? Let's do the same thing - where the null hypothesis is accurate - but with 10X as many samples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_indResult(statistic=0.7294273914972799, pvalue=0.4657411216867331)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A = np.random.normal(25.0, 5.0, 100000)\n",
    "B = np.random.normal(25.0, 5.0, 100000)\n",
    "\n",
    "stats.ttest_ind(A, B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our p-value actually got a little lower, and the t-test a little larger, but still not enough to declare a real difference. So, you could have reached the right decision with just 10,000 samples instead of 100,000. Even a million samples doesn't help, so if we were to keep running this A/B test for years, you'd never acheive the result you're hoping for:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_indResult(statistic=-0.9330159426646413, pvalue=0.350811849590674)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A = np.random.normal(25.0, 5.0, 1000000)\n",
    "B = np.random.normal(25.0, 5.0, 1000000)\n",
    "\n",
    "stats.ttest_ind(A, B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we compare the same set to itself, by definition we get a t-statistic of 0 and p-value of 1:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_indResult(statistic=0.0, pvalue=1.0)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stats.ttest_ind(A, A)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The threshold of significance on p-value is really just a judgment call. As everything is a matter of probabilities, you can never definitively say that an experiment's results are \"significant\". But you can use the t-test and p-value as a measure of signficance, and look at trends in these metrics as the experiment runs to see if there might be something real happening between the two."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Activity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Experiment with more different distributions for A and B, and see the effect it has on the t-test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}