- Get the sample files
- View Java code:
- View Python code:
Example 1: Spam Recognition
This example illustrates how Winnow was used for spam email filtering in the early 2000's as part of the "CRM114 classifier". Text is retrieved from email, tokenized, and then rendered into special data structures called "features". Winnow is then used to classify the features as spam or non-spam. The interesting thing to note about these feature structures is how they are used to put words into a useful kind of context.
The program design works like this: To extract the features from a file, all the contents are read, character for character, into a buffer, then a regular expression is used to tokenize it all into individual words. Then words are taken in groups of five (or whatever you set FEATURE_LENGTH to), and arranged according to orthogonally sparse bigrams (OSB) to keep in the database. The database is essentially all the input nodes to the Winnow network.
With each successive text file read, the features extracted from it are compared to those already existing in the network. Any matching features are considered "active features", and they in turn have their weights adjusted.
It's important to note that using the right data structures for a Winnow implementation mean everything. Dictionary type data structures can be searched quickly and added to efficiently. Manually searching through static arrays for matching features will slow the algorithm down to a crawl.
The sample email files come with file extensions like ".spam" and ".good" for use in supervised learning. I've made a collection of these files out of various email messages and spam of mine and others where I work (it seems I have an endless supply of spam to use for these tests!). One thing you'll note about these email samples: they don't have any header information (normally, all email has header data attached to it like sender, addressee, protocol, servers, dates, times, etc.). I know that any useful spam filter would be particularly aimed at header data. However, one of my ultimate interests in artificial intelligence is natural language processing, so I chose to stick with the human language portion of the emails only.
NOTE: The SAMPLES_PATH constant will need to be adjusted to the samples directory on your machine.
Click here to download the sample files.
- Step 1: Collect a list of sample files and randomize their order.
- Step 2: For each file:
- Step 3: Retrieve text from file.
- Step 4: Tokenize the text (i.e.; split the text into its individual words).
- Step 5: Determine active features by comparing each token to features stored in the database.
- If a token matches a previously stored feature, it is active.
- If not a match, it is set aside.
- Step 6: For each active feature:
- Step 7: Predict if it is "spam" or "good".
- Step 8a: If prediction is "spam" but the filename indicates "good":
- Increase the active feature's "good" weight.
- decrease the active feature's "spam" weight.
- Step 8b: If prediction is "good", but the filename indicates "spam":
- Increase the active feature's "spam" weight.
- decrease the active feature's "good" weight.
- Step 9: Apply thickening to weight values:
- If the active feature's weight values are too close, widen the gap.
- Step 10: Finish step 5:
- For all tokens which did not match previously stored features:
- Create new features from tokens and store them in the database.
- For all tokens which did not match previously stored features:
Feature selection, in terms of spam filtering, involves sliding a window of a given length over tokenized text. Remember that making tokens means breaking down streams of characters into what we would perceive as words, names, numeric references, etc. Siefkes et al. (2004) experimented with various window sizes and determined that a five-token window works best.
Figure 3. An example of a five-token window. The window is moved from left to right, one token at a time. The next position of the window in this example will include the word "only" and lose the word "offer".
Most of the articles about Winnow and spam filtering mention regular expressions as the primary means of rendering text streams into tokens. The table below is based on the regex types used by Siefkes et al (2004).
|Plain Tokenization (everything except whitespace and control characters):|
|CRM114 (whitespace, control and alphanumeric characters):|
|Simplified CRM114 (excludes periods, commas and colons):|
|Markup Language Sensitive (HTML):|
Unigrams, Bigrams, Polygrams...
Here, feature selection means more than single words (unigrams), it means phrases (in whole or in part). Unigram searches tend to generate too many false positives when particular words can be used in either acceptable or unacceptable contexts. For example, the word, "offer", could be found in unwanted advertisement email (i.e.; "This is a limited time offer! So you better hurry!"). However, the same word, "offer", is used in many other contexts found in expected email (i.e.; "I'd offer you my cell phone, but I'm expecting a call from...", or, "If you wanna go fishing this weekend, my offer still stands."). This is why bigrams (two words) and various other polygram schemes were implemented; word patterns indicate at least some level of context.
Orthogonally Sparse Bigrams (OSB)
OSB seems to be the most popular feature structure because it avoids redundancy and generally improves the performance of many spam filtering algorithms. Without going into any mathematical description, I'll just say they call it "orthogonal" because of the way each successive feature pattern is created without duplicating any of the previous patterns. The focus of each feature is one word (i.e.; the last one), and the location of other words in the pattern provides a sense of context.
Notice also the "<skip>" symbols, these are very important to the structure of a feature. Think of them as wildcard placeholders where any word or token can fit. The filtering algorithm can use this to compare many different combinations. For instance, if we look at the last line in Figure 4a, "is <skip> <skip> <skip> offer", the filter algorithm could use this to detect, "is a limited time offer", or "is a very special offer", "is our very best offer", "is a VIP only offer", as well as many other similar combinations.
|Orthogonally Sparse Bigrams|
Figure 4a. An example of the OSB features extracted from a sliding window: "is a limited time offer".
Figure 4b. I couldn't find any description of what to do with the empty feature elements in Figure 4a. Many authors use this same or similar graphs to describe OSB, but no one said anything about the empty elements. So when it came time to implement OSB feature structure in code, I simply considered the empty elements as "<skip>".
Sparse Binary Polynomial Hashing (SBPH)
I'm mentioning SBPH here for comparison to OSB. I haven't tried it yet. SBPH was what they used before OSB took the industry by storm. They called it "sparse" because of the <skip> characters, and the "binary polynomial hashing" had to do with various methods of optimized coding and storing of features. "Hashing" also offered some level of security.
Note how many features are created with SBPH. There are no duplicates, but researchers found that the shear size of the feature base collected by an algorithm hindered rather than helped.
|Sparse Binary Polynomial Hashing|
Example Results 1
The output is lengthy, so I've abreviated it here. And since the files are chosen randomly, the results can vary.
Note that "Actual = ..., but Expected = ..." indicate an incorrect prediction for the corresponding input file.
In the beginning, virtually all predictions are wrong, but toward the end, fewer errors occur.
processing: good1.good Actual = good 'Good' score = 0.0 'Spam' score = 0.0 Actual = 'good', but Expected = 'spam'. processing: good42.good Actual = good 'Good' score = 3.0 'Spam' score = 3.0 Actual = 'good', but Expected = 'spam'. processing: spam47.spam Actual = spam 'Good' score = 2.0 'Spam' score = 2.0 Actual = 'spam', but Expected = 'good'. --- results abbreviated --- processing: good15.good Actual = good 'Good' score = 105.73050013287809 'Spam' score = 115.03541586758665 Actual = 'good', but Expected = 'spam'. processing: spam94.spam Actual = spam 'Good' score = 86.69176474584701 'Spam' score = 115.11175887918309 processing: spam95.spam Actual = spam 'Good' score = 729.9708280015628 'Spam' score = 747.2345231422602 processing: spam108.spam Actual = spam 'Good' score = 147.7729642349633 'Spam' score = 172.88913265295966 processing: spam97.spam Actual = spam 'Good' score = 54.66697534857083 'Spam' score = 84.25032119166698 processing: good4.good Actual = good 'Good' score = 0.0 'Spam' score = 0.0 Actual = 'good', but Expected = 'spam'. processing: spam104.spam Actual = spam 'Good' score = 197.39849701487185 'Spam' score = 212.39728041348977 Features in database: 102504 Total features created: 102504
Example Results 2: Error Tracking
The code examples with extra error tracking functions can indicate just how well the training works. The results appear the same except for the list of numbers toward the end.
Average Error frequency for every 10 samples... At the beginning: 0.95 0.95 0.95 0.8833333333333334 0.8833333333333332 0.8833333333333332 0.9333333333333332 0.9333333333333332 0.8666666666666666 0.8666666666666668 0.8166666666666668 0.8166666666666667 0.8166666666666667 0.8333333333333333 0.7666666666666667 0.7166666666666667 0.6499999999999999 0.5999999999999999 0.6166666666666666 0.5666666666666667 0.5366666666666667 0.4616666666666666 0.41166666666666674 0.4616666666666667 0.5283333333333333 0.5783333333333334 0.5700000000000001 0.62 0.67 0.6366666666666667 0.6366666666666667 0.6366666666666667 0.600952380952381 0.600952380952381 0.6009523809523809 0.6009523809523809 0.6009523809523809 0.525952380952381 0.525952380952381 0.5426190476190477 0.5559523809523809 0.5420634920634919 0.5402777777777776 0.4545634920634921 0.3795634920634921 0.37956349206349205 0.37456349206349204 0.37456349206349204 0.3078968253968254 0.32456349206349205 0.3245634920634921 0.3217857142857143 0.3203968253968254 0.3227777777777778 0.3977777777777778 0.3477777777777778 0.37777777777777777 0.36527777777777776 After final sample.
If these are plotted on a chart, we'd see this:
So the frequency of errors decreases as the algorithm continues learning.
Assis, F., Yerazunis, W., Siefkes, C., and Chhabra, S. (2005). "CRM114 versus Mr. X: CRM114 Notes for the TREC 2005 Spam Track." NIST Special Publication: SP 500-266; The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings. PDF.
Littlestone, N. (1988). "Learning Quickly When Irrelevant Attributes Abound: A New Linear-threshold Algorithm". Machine Learning vol. 2. pp. 285-318.
Siefkes, C., Assis, F., Chhabra, S. & Yerazunis, W. S. (2004). "Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering." J.-F. Boulicaut et al. (Eds.): PKDD 2004, LNAI 3202. Springer-Verlag Berlin Heidelberg. pp. 410-421.
Related sites: Spam Filtering and Text Classification by Christian Siefkes.