How to Test DSPAM This is an attempt to help anyone looking to test DSPAM, however many of these pointers could be applied to any type of filter. I've found, after looking at many filter test attempts, that there are a lot of question marks, the biggest of which being such a contrast between what the tests have shown and what many real-world users have reported. Because many of these tests are closed, there is no real way to evaluate the credibility of these tests, however I've managed to compile a growing list of common mistakes that are likely to have been made based on what I've read - in at least some of the tests I've seen today. It's my hope that this will be taken as an attempt to help improve the testing process, rather than an attempt to discredit any particular test. While I believe there are some bad tests out there which are less than credible, I see others getting closer to being on the right track. Hopefully at least some testers will step up and make their frameworks open for evaluation which will, of course, make improving them much easier. Testing isn't really the right terminology when approach adaptive filters. Since the output of a filter depends on the sum of all of its previous inputs, it's more appropriate to set your goal to be establishing a true _simulation_ rather than test, and taking the metrics you need during the simulation. Testing is generally called testing because it's a test - and not the real mccoy. I think you'll find that the closer you get to performing a true simulation, the more your results will begin to line up with what each filters' users are actually seeing. Just to give you an example of the difference, DSPAM's pre-production tests for v3.4 showed a mere 13 spam misses and 0 false positives out of a corpus of over 5,000 messages. This was done with the SpamAssassin public corpus from 2003 and part of 2002. The levels of accuracy I receive on my own mail, however, are much higher - I rarely see a spam for months at a time. This is because testing with the SA corpus isn't nearly as real- world as testing with a real corpus of mail. If you're interested in performing a credible test including DSPAM, please feel free to contact me at jonathan@nuclearelephant.com. I look forward to helping to make the test a successful one. Common testing mistakes (in no particular order): Mistake 1: Testing without turning off "Statistical Sedation" feature DSPAM has a feature called statistical sedation which is a tunable parameter the systems administrator can set in dspam.conf allowing DSPAM to water down its catch rate in an attempt to be completely paranoid about false positives. This watering down occurs until 2500 innocent messages have been trained (e.g. the user's "training loop"). The default level is set to mid-level sedation. This is defaulted to "on" for two reasons: First, it's done to cater to service providers (the majority of DSPAM users) whose primary goal is to avoid as many false positives as possible, even if it means missing more spam. False positives cost money - customer service calls to complain, lost customers, and the cost to end-users of missing a job offer or a contract in their mailbox. There is far less liability in missing spam than missing real mail, and service providers like avoiding liability. Secondly, this feature is designed for users who are training from an EMPTY CORPUS (which most DSPAM users are). The way statistical filtering works is that words appearing in only spam are considered extremely evil. So if you get a lot of spam, you need the filter to wait until you get enough nonspam to balance the vocabulary. When testing, however, you're not acting as a paranoid service provider and you're not starting with an empty corpus. A corpus is provided and pre-trained on X messages, so a vocabulary is already built and balanced. On top of this, testers want true results rather than paranoid results about missing spam. Unfortunately, some testers fail to turn off this feature which is why you see DSPAM always perform excellent with false positives in tests, but not catch much spam. To run a true test, sedation should be disabled prior to testing to prevent DSPAM from waiting for more ham before filtering. This can be done by changing this line in dspam.conf: Feature tb=4 To the following: Feature tb=0 Or, of course, by simply training 2500 nonspams prior to performing evaluation. Most testers think 1000-2000 pre-trained messages should be enough to conduct a test, and I agree - however, my users want to be able to start from scratch, and that requires a certain level of additional safeguarding. Once the safeguarding is turned off, one should notice a definite improvement in filtering ability when starting with a pre-trained corpus. If you feel a civic duty to train with sedation, consider training without it as well and including both findings in your results. Mistake 2: Training with a slow storage driver. DSPAM support many different types of storage backends. Many of these backends (such as Berkeley DB and SQLite) are considered 'residential' drivers for an individual user. They are significantly slower than more high-end drivers such as the MySQL driver and may even face file corruption issues under heavy loads. In order to adequately test DSPAM's true processor consumption, it's strongly recommended that a storage driver with little overhead be used. The recommended production storage driver for multiuser or testing environments is MySQL. This is the recommended driver for two reasons. 1. It has the least overhead (so you can measure DSPAM, and not the backend), and 2. It is the most stable backend and least likely to result in skewed tests in the event of data corruption - especially when performing testing at the high volumes which most "residential" storage drivers aren't designed to handle. Further more, Berkeley DB is considered "deprecated", SQLite is considered "Beta", and the Oracle driver is completely unsupported at the moment. This leaves PostgreSQL, which we are working on - it is considerably slower, but is our only other fully supported and recommended production driver. We still recommend people use MySQL unless religious convictions make this impossible. Mistake 3. Providing the original sample in the feedback loop DSPAM's error correction process uses, by default, a signature embedded into each email. This signature is a "serial number" providing a reference to the original training data stored (temporarily) in the database and ensures that the same exact data is relearned properly. Without this signature, DSPAM v3.x will intentionally fail to learn, which is designed to throw up a red flag for systems administrators. Previous 2.x versions ambiguously attempted to "do the best they could" which included ignoring the forwarding user's headers and learning parts of the forwarded body. This ended up creating a huge mess when the post-processed signature wasn't passed in (which is why we changed the behavior in 3.x to just break on impact). Regardless of which version you use, it is critical that any error correction be done using the _outputted_ message with DSPAM's added headers and _not_ the original text from the corpus. This was one of the major flaws that appeared during Cormack & Lynam's tests, which had to be corrected before the tests could continue. DSPAM _can_ be configured to train with pristine messages, but this _requires_ some configuration changes, and to ensure that your testing is accurate it would be highly discouraged. If you are uncertain whether messages are being retrained correctly, you can turn on debugging and check for errors such as "signature not found" or "failed to load signature". If you're running v3.x, the SM counter should increase when you run dspam_stats. This signifies that the message was retrained properly. If you're running 2.x, the counter will still increase, but that is not a guarantee that the filter is learning properly - you'll have to check debug. Mistake 4: Using a short-term corpus snapshot Some have made the error of using a two or four-week snapshot of their email as an evaluation medium without ever realizing the importance of the time period over which the data has been collected. It's very easy to collect 10,000 copies of the same spam, and when this happens, the filter learns only the contextual characteristics of that one type of spam. A few weeks is only long enough for the filter to see the same spam over and over, then one or two messages of entirely new spam (which isn't enough to train on). Two weeks just isn't a long enough period of time to capture all of the different permutations of spams and therefore tests based on a corpus taken only over a short period are likely to result in significant errors when they come across other types of spam. The old adage, quality is more important than quantity, holds true here. Not only do filters need to see certain patterns in spam, but they need to see it at least 5 or 6 times before the filter will even consider it good data to work with. A good testing corpus should be taken over a period of 4-6 months. Mistake 5: Using a noncontiguous corpus Some tests have made the mistake of trying to test statistical filters in much the same way as heuristic filters are tested - by throwing thousands of random emails at the filter. This naturally results in very poor results because of the nature of the filter. Statisitcal filters, over a period of time, have the innate ability of detecting unique characteristics of the email familiar people send. This can include tokens from the Received and From headers, or even things like the HTML constructs used by a particular friend's email client. It's very important for any testing corpus to be not only contiguous in time period, but also contiguous in thread. If you realize that a statistical filter is one big sequential circuit, this makes perfect sense...the filter's output is a sum of all of its previous inputs. From a lexical perspective, if you break the corpus then you break the continuity of data. Lets say that you train messages A-H. Well, if you processed messages I, J, and K afterwards then they might be classified correctly but only with marginal confidence. I, J, and K will have contained some additional new diction that will be present in L, M, N and O as the conversation evolves. So if you fail to train I, J, and K, which would've just made it in by the skin of their teeth, then L, M, N, and O end up being misclassified because of the missing data to push the confidence level over the edge. It's a quite simple concept really - while the results of Bayesian filters are 1 and 0, a single data point could easily make a difference between the result. You can't justifiably classify message L unless A-K have been adequately trained. Mistake 6: Using only a single corpus of email Running tests on a single corpus of email only shows how well a filter performed on that person's email. There's no way to tell (without the actual corpus) whether the user had abnormalities in their email, changed their behavior frequently, etcetera. Any credible test should use a minimum of 3 or 4 different corpora, the more the better. Depending on what you're trying to prove, you may want each corpus to be from not only a different individual, but a different individual in a different social class - e.g. teenagers, nerds, college students, housewives, etcetera. Even medical tests, which zone in on one particular group of people, test in quantity within that group. If your test is to prove the accuracy for typical nerds, use three or four different nerds' email. If you're trying to prove accuracy for the entire world, use many different classes and many corpora per class. Mistake 7: Using an old or development version of the software. When testing, you should always use the latest stable version of the software available. I've seen some tests for DSPAM lately which date back to versions more than a year and a major revision old. When testing DSPAM, you should use at least the current major revision, but the latest stable is best. 3.x has been around for 7 months. Always use the latest patch version available as well. Never use odd-version numbered DSPAM releases, which are appropriately labeled development releases. What's the point of testing a year-old version of some software, unless you're trying to make a point that a year ago, this software performed this way? The test is already useless to most people, and misleading as well. If one _does_ have adequate justification to test old versions, they should state this and they should also ensure that every filter's version come from the same cutoff. Mistake 8: Using a "stock" configuration, without testing a couple alternatives Users have the ability to tweak many different options in a filter. These include training mode, tokenizer mode (always use chained when testing DSPAM), and other features. Try a few different configurations and purpose to identify what would work ideal if this user was using the software in real life. No sane user would use a set of features that didn't suit them, and no sane systems administrator wouldn't tweak a stock config to suit their needs. Bottom line: do you want to test a filter's accuracy, or do you want to test the filter author's skills in reading the minds of sysadmins well enough to write a default config? Mistake 9: Not asking for assistance and validation from the filters' users. Before posting any results, make them known to the users of each filter you're testing and ask for opinions, peer-review, and thoughts on how to make the test better. Nearly all of the tests running in the wild today take place without any input from people who are actually using these filters, and are set up after only reading through a small part of the documentation. Not to say that end-users should dictate how you test, but they are likely to know more about the caveats of testing than someone new to the software. A true, unbiased tester is more interested in ensuring that their tests are performed properly than making the results fit into their expectations. Mistake 10: Comparing apples to oranges Using the same corpus doesn't necessarily mean your comparisons are sane. Comparing a heuristic filter to a statistical filter doesn't make sense as they're both designed for very different functions (heuristic filters are pre-programmed to filter _today's_ spam, while statistical filters are designed to filter and adapt). Therefore, any heuristic filter has been hard-coded to provide a set of responses to the specific data you're presenting it with. If you're going to compare a statistical filter to a heuristic filter, you should use a version of the heuristic filter's rulesets published at the time that the corpus originally began (which should be at least 6 months old) to measure the true capabilities of the filter, rather than how well the programmer singled out emails he knew the filter was going to see that month. A heuristic filter has been written to respond to the emails of the time - it's like studying for a test by peeking at the questions. That's fine if you want to test it with a 6-month corpus beginning the day after the rulesets were committed, as the mail will be new. But if you test with the spam that was available _when_ the tests were written, all you're testing is the ability for the filter author to code around your test. Mistake 11: Leaving your test framework closed It doesn't really matter what your tests prove if there's no way to evaluate just how accurate your tests are. Only open tests where the testing framework (including source code) is open to evaluation are truly credible. If you must keep your corpora private, at least make the sources available so others can run them and validate (or help improve) your results. What you set out to do in method doesn't always align with the source code. It's critical that the community evaluate not only the framework but the actual implementation of it in code form. This isn't a concern in other types of testing (such as medical testing) because people trust that you do what you say you do. Code, however, might _not_ do what you tell it to do and someone should verify this. Mistake 12: Conducting a test with bias You can prove just about anything you set out to if you put your mind to it. An individual conducting a test should never include their favorite filter(s) or filters they are emotional about (e.g. from some bumping of heads or other issues) in their tests. If a single filter was used to classify your corpus, that filter should be eliminated from the test for the sole reason that it's obviously going to classify the mail as it did the first time - even if it's wrong about it. Mistake 13: Not R'ing TFM Many mistakes are made by simply skimming through the documentation in an attempt to streamline the testing process. It's critical that the testers understand the different tunable parameters, warnings, and notes about each filter before testing it. Please take the time to read the docs. DSPAM-Specific Known Testing Anomalies: 1. Bounce Spams Version 2.x had a bug/feature (we're still debating whether this was intentional or not) to ignore bounce messages. This behavior has been corrected in 3.x, so that all bounced are evaluated before delivered. If you're wondering why your test showed a failure to detect bounce spams, it is most likely due to this. I recommend testing against 3.x for better results. 2. Sedation Defaults Version 3.x, up to 3.2.6, had a bug causing the statistical sedation level of '5' (the default) to be set (even if overridden in dspam.conf), if the --feature flag was used but didn't reference the sedation level. To be certain that sedation is turned off, include tb=0 in your feature commandline (be sure to include all the other features you want to include as well, such as 'ch' and 'wh').