Return-Path: anthony@interlink.com.au
Delivery-Date: Thu Sep 12 05:26:41 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Thu, 12 Sep 2002 14:26:41 +1000
Subject: [Spambayes] Current histograms 
Message-ID: <200209120426.g8C4QfI23085@localhost.localdomain>


> They weren't partitioned in any particular scheme - I think I'll write a
> reshuffler and move them all around, just in case (fwiw, I'm using MH 
> style folders with numbered files - means you can just use MH tools to 
> manipulate the sets.)

Freak show. Obviously there _was_ some sort of patterns to the data:

Training on Data/Ham/Set1 & Data/Spam/Set1 ... 1798 hams & 1546 spams
      0.779   0.582
      0.834   0.840
      0.945   0.452
      0.667   1.164
Training on Data/Ham/Set2 & Data/Spam/Set2 ... 1798 hams & 1547 spams
      1.112   0.776
      0.834   0.969
      0.779   0.646
      0.667   1.100
Training on Data/Ham/Set3 & Data/Spam/Set3 ... 1798 hams & 1548 spams
      1.168   0.582
      1.001   0.646
      0.834   0.582
      0.667   0.453
Training on Data/Ham/Set4 & Data/Spam/Set4 ... 1798 hams & 1547 spams
      0.779   0.712
      0.779   0.582
      0.556   0.840
      0.779   0.970
Training on Data/Ham/Set5 & Data/Spam/Set5 ... 1798 hams & 1546 spams
      0.612   0.517
      0.779   0.517
      0.723   0.711
      0.667   0.582
total false pos 144 1.60177975528
total false neg 101 1.30592190328

(before the shuffle, I was seeing:
total false pos 273 3.03501945525
total false neg 367 4.74282760403
)

For sake of comparision, here's what I see for partitioned into 2 sets:

Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4492 hams & 3872 spams
      0.490   0.776
Training on Data/Ham/Set2 & Data/Spam/Set2 ... 4493 hams & 3868 spams
      0.401   0.491
total false pos 40 0.445186421814
total false neg 49 0.633074935401

more later...

Anthony