Return-Path: tim.one@comcast.net
Delivery-Date: Sun Sep  8 08:18:49 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 08 Sep 2002 03:18:49 -0400
Subject: [Spambayes] test sets?
In-Reply-To: <200209080538.g885cjk17553@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEOIBCAB.tim.one@comcast.net>

[Guido]
> I *meant* to say that they were 0.99 clues cancelled out by 0.01
> clues.  But that's wrong too!  It looks I haven't grokked this part of
> your code yet; this one has way more than 16 clues, and it seems the
> classifier basically ended up counting way more 0.99 than 0.01 clues,
> and no others made it into the list.  I thought it was looking for
> clues with values in between; apparently it found none that weren't
> exactly 0.5?

There's a brief discussion of this before the definition of
MAX_DISCRIMINATORS.  All clues with prob MIN_SPAMPROB and MAX_SPAMPROB are
saved in min and max lists, and all other clues are fed into the nbest heap.
Then the shorter of the min and max lists cancels out the same number of
clues in the longer list.  Whatever remains of the longer list (if anything)
is then fed into the nbest heap too, but no more than MAX_DISCRIMINATORS of
them.  In no case do more than MAX_DISCRIMINATORS clues enter into the final
probability calculation, but all of the min and max lists go into the list
of clues (else you'd have no clue that massive cancellation was occurring;
and massive cancellation may yet turn out to be a hook to signal that manual
review is needed).  In your specific case, the excess of clues in the longer
MAX_SPAMPROB list pushed everything else out of the nbest heap, and that's
why you didn't see anything other than 0.01 and 0.99.

Before adding these special lists, the outcome when faced with many 0.01 and
0.99 clues was too often a coin toss:  whichever flavor just happened to
appear MAX_DISCRIMINATORS//2 + 1 times first determined the final outcome.

>> That sure sets the record for longest list of cancelling extreme clues!

> This happened to be the longest one, but there were quite a few
> similar ones.

I just beat it <wink>:  a tokenization scheme that folds case, and ignores
punctuation, and strips a trailing 's' from words, and saves both word
bigrams and word unigrams, turned up a low-probability very long spam with a
list of 410 0.01 clues and 125 0.99 clues!  Yikes.

> I wonder if there's anything we can learn from looking at the clues and
the
> HTML.  It was heavily marked-up HTML, with ads in the sidebar, but the
body
> text was a serious discussion of "OO and soft coding" with lots of highly
> technical words as clues (including Zope and ZEO).

No matter how often it says Zope, it gets only one 0.01 clue from doing so.
Ditto for ZEO.  In contrast, HTML markup has many unique "words" that get
0.99.  BTW, this is a clear case where the assumption of
conditionally-independent word probabilities is utterly bogus -- e.g., the
probability that "<body>" appears in a message is highly correlated with the
prob of "<br>" appearing.  By treating them as independent, naive Bayes
grossly misjudges the probability that both appear, and the only thing you
get in return is something that can actually be computed <wink>.

Read the "What about HTML?" section in tokenizer.py.  From the very start,
I've been investigating what would work best for the mailing lists hosted at
python.org, and HTML decorations have so far been too strong a clue to
justify ignoring it in that specific context.  I haven't done anything
geared toward personal email, including the case of non-mailing-list email
that happens to go through python.org.

I'd prefer to strip HTML tags from everything, but last time I tried that it
still had bad effects on the error rates in my corpora (the full test
results with and without HTML tag stripping is included in the "What about
HTML?" comment block).  But as the comment block also says,

# XXX So, if another way is found to slash the f-n rate, the decision here
# XXX not to strip HTML from HTML-only msgs should be revisited.

and we've since done several things that gave significant f-n rate
reductions.  I should test that again now.

> Are there any minable-but-unmined header lines in your corpus left?

Almost all of them -- apart from MIME decorations that appear in both
headers and bodies (like Content-Type), the *only* header lines the base
tokenizer looks at now are Subject, From, X-Mailer, and Organization.

> Or do we have to start with a different corpus before we can make
> progress there?

I would need different data, yes.  My ham is too polluted with Mailman
header decorations (which I may or may not be able to clean out, but fudging
the data is a Mortal Sin and I haven't changed a byte so far), and my spam
too polluted with header clues about the fellow who collected it.  In
particular I have to skip To and Received headers now, and I suspect they're
going to be very valuable in real life (for example, I don't even catch
"undisclosed recipients" in the To header now!).

> ...
> No, sorry.  These were all of the following structure:
>
>   multipart/mixed
>       text/plain        (brief text plus URL(s))
>       text/html         (long HTML copied from website)

Ah!  That explains why the HTML tags didn't get stripped.  I'd again offer
to add an optional argument to tokenize() so that they'd get stripped here
too, but if it gets glossed over a third time that would feel too much like
a loss <wink>.

>> This seems confused: Jeremy didn't use my trained classifier pickle,
>> he trained his own classifier from scratch on his own corpora.
>> ...

> I think it's still corpus size.

I reported on tests I ran with random samples of 220 spams and 220 hams from
my corpus (that means training on sets of those sizes as well as predicting
on sets of those sizes), and while that did harm the error rates, the error
rates I saw were still much better than Jeremy reported when using 500 of
each.


Ah, a full test run just finished, on the

   tokenization scheme that folds case, and ignores punctuation, and strips
a
   trailing 's' from words, and saves both word bigrams and word unigrams

This is the code:

            # Tokenize everything in the body.
            lastw = ''
            for w in word_re.findall(text):
                n = len(w)
                # Make sure this range matches in tokenize_word().
                if 3 <= n <= 12:
                    if w[-1] == 's':
                        w = w[:-1]
                    yield w
                    if lastw:
                        yield lastw + w
                    lastw = w + ' '

                elif n >= 3:
                    lastw = ''
                    for t in tokenize_word(w):
                        yield t

where

    word_re = re.compile(r"[\w$\-\x80-\xff]+")

This at least doubled the process size over what's done now.  It helped the
f-n rate significantly, but probably hurt the f-p rate (the f-p rate is too
low with only 4000 hams per run to be confident about changes of such small
*absolute* magnitude -- 0.025% is a single message in the f-p table):

false positive percentages
    0.000  0.000  tied
    0.000  0.075  lost  +(was 0)
    0.050  0.125  lost  +150.00%
    0.025  0.000  won   -100.00%
    0.075  0.025  won    -66.67%
    0.000  0.050  lost  +(was 0)
    0.100  0.175  lost   +75.00%
    0.050  0.050  tied
    0.025  0.050  lost  +100.00%
    0.025  0.000  won   -100.00%
    0.050  0.125  lost  +150.00%
    0.050  0.025  won    -50.00%
    0.050  0.050  tied
    0.000  0.025  lost  +(was 0)
    0.000  0.025  lost  +(was 0)
    0.075  0.050  won    -33.33%
    0.025  0.050  lost  +100.00%
    0.000  0.000  tied
    0.025  0.100  lost  +300.00%
    0.050  0.150  lost  +200.00%

won   5 times
tied  4 times
lost 11 times

total unique fp went from 13 to 21

false negative percentages
    0.327  0.218  won    -33.33%
    0.400  0.218  won    -45.50%
    0.327  0.218  won    -33.33%
    0.691  0.691  tied
    0.545  0.327  won    -40.00%
    0.291  0.218  won    -25.09%
    0.218  0.291  lost   +33.49%
    0.654  0.473  won    -27.68%
    0.364  0.327  won    -10.16%
    0.291  0.182  won    -37.46%
    0.327  0.254  won    -22.32%
    0.691  0.509  won    -26.34%
    0.582  0.473  won    -18.73%
    0.291  0.255  won    -12.37%
    0.364  0.218  won    -40.11%
    0.436  0.327  won    -25.00%
    0.436  0.473  lost    +8.49%
    0.218  0.218  tied
    0.291  0.255  won    -12.37%
    0.254  0.364  lost   +43.31%

won  15 times
tied  2 times
lost  3 times

total unique fn went from 106 to 94