Why GerH builds have a 'markovian.alt' and other '.alt' classifiers: those are R&D classifiers using the latest VT technology and classifier optimizations described here.
Note that this space shows the state of the art as per 7 february 2009. Any source code in this article shows the state as per that date (or earlier); their value is purely educational. Fetch a GerH build if you like to have this stuff in a usable form.
Markovian does something 'extra' to the features, while classifying.
First, let's have a look at the common denomination, function/code-wise, for Markovian, OSB, OSBF and Winnow.
that one's rather easy. You take the input, chop it up into features (i.e. word [sequence] hashes), and for each feature, you create a hash index (feature hash value modulo CSS database size) and +1 the weight in that spot if you train, OR -1 the weight if you 'refute', i.e. 'unlearn'.
That way, features in existing [trained] input are registered in the CSS database and those features which occur in more than one message (or more precisely at the current stage of development as OSB et al don't support
<unique> for training: those features which occur more than once, period. Hyperspace has
<unique> and that makes sure each feature can only occur _once_ in each input. Duplicate features are removed there: say you hash per 1 word, than each 'in' in this email would hash to the same feature hash;
<unique> would tell crm114 not to bother about 'in' occurring multiple times within this email; it is counted only once.
But then, OSB et al don't have
<unique> for training. No worries. It's just an option.)
Which leads us to what 'features', in the mind of crm114, really are. Are they 'words'? Or what some would call 'tokens'?
Not precisely. Feature hashes are 'token based', which means it takes at least one token to produce one feature hash. Note the 'at least' there.
A 'token' is one 'token regex matching string' extracted from the input.
The default 'token regex' looks for words in whitespace delimited input, but you can have custom regexes (in the
// args for the train command) and then there's
<bychar> which replaces the default regex with '.', which just matches a single character only, whitespace or not. For all these, crm114 doesn't really care: it takes the longest string matching the token regex and declares that to by 1 (say 'one') token. When that token has been munched, the process repeats with the remainder of the input, until EndOfInput is reached.
Then, what about that 'at least one' in the conversion from regex match, a.k.a. 'token', to 'feature hash'?
Pure, old, uncut, silicone-free Bayes would indeed imply one feature hash per token. Using the defaults, that's one word producing one feature hash each.
crm114 acts like this when you specify the
<unigram> option: this option instructs crm114 to take one token, create one hash and forget about all the other fancy stuff, like markov chains, etc.
Default OSB/OSBF/Winnow/Markovian in crm114 are a bit more sophisticated when it comes to producing feature hashes: of course, there's, depending on your classifier, that feature which represents one word with a feature hash (unigram), but generally, crm114 'bayesian' classifier act rather more 'markovian': all classifiers, by default, pick not just one token, but pick two (or in the case of VT-ed classifiers: multiple) tokens, which are close together, and produce one hash for that. Then additional hashes are produced for different token sequence picks.
Let's clarify this a bit with a example. Assume your input is words, here represented by numbers: 1 2 3 4 5 6
That's 6 tokens.
What crm114 does, is assume a certain 'pipeline depth', which is done by prefixing the input with 'deadbeef' D, so the code doesn't suffer from edge conditions at startup. Given the usual pipeline depth of 5, the token input thus looks like
Da Db Dc Dd 1 2 3 4 5 6
and crm114 fetches token '1'. The 'markovian' behaviour here is to pick two tokens, one current and one 'historic', from the pipeline, and combine them into a feature hash:
1 + Dd -> one feature hash 1 + Dc -> one feature hash 1 + Db -> one feature hash 1 + Da -> one feature hash
Given this behaviour, the sample input will produce a slew of feature hashes, each one represented by the tokens, between braces:
(1+Da),(1+Db),(1+Dc),(1+Dd), (2+1),(2+Da),(2+Db),(2+Dc), (3+2),(3+1),(3+Da),(3+Db), (4+3),(4+2),(4+1),(4+Da), (5+4),(5+3),(5+2),(5+1), (6+5),(6+4),(6+3),(6+2),
Side note / implementation consideration:
Then current codebase has a 'flaw', or call it a 'feature', which prevents crm114 from continuing on past the end, so as to include the entire 'end edge' as it did the 'start edge' using those deadbeef's:, (De+6),(De+5),(De+4),(De+3) , (Df+De)[*],(Df+6),(Df+5),(Df+4) , (Dg+Df)[*],(Dg+De)[*],(Dg+6),(Dg+5) , (Dh+Dg)[*],(Dh+Df)[*],(Dh+De)[*],(Dh+6)
The ones marked with [*] should be removed from the hash stream as they would occur for each and every input (as they are independent of the input), and the others in there carry only a single actual input token, like the Dx+n hashes at the start of the input stream, so their usefulness is disputable, when all the other feature hashes combine 2 input tokens (except those deadbeef-ed ones at the start - more about those later on).
VT (Vector Tokenization) can mix any number of input tokens
Nevertheless, VT allows for mixing more than one token into a hash, e.g. (5+4+1) and even more complex Markov chains (=~ tokens stringed together), in which case those extra few hashes at the end may be useful after all, especially when you are processing short messages, i.e. inputs with few tokens.
I rather think of this as a 'feature' instead of a 'flaw', despite the counterpoint from the start, where deadbeef is mixed in with the actual input token stream, as you can see handling the 'end game' here requires some special edge action anyway (as you must remove all those useless (Dx+Dy)[*] hashes).
Start and End [Of Input] conditions and what to do about them
As the start hashes arguably contain only half of the input entropy their brethren in the middle carry: (Dx+n) vs. (n+m), or put another way: the hashes at the start are produced by shorter chain lengths, and their worth, such as they come, is disputable as such. After all, those chain lengths have been introduced to improve selectivity in the classifier and right at the start we introduce [a bit of] pollution, or so one might argue.
It may not be harmful for [large] email messages, but the effect, at a glance, may be more significant for short messages, e.g. when you try to filter SMS (mobile text) message traffic, or other low token count inputs.
The End Of Input dilemma: to ignore or to ..., that's the question
The second bit you probably already noticed as well is that, when you look at an input token n, it will occur as part of the mix in multiple resulting feature hashes. At least, when token n has been sitting at the start or in the middle of the input message: token 1 occurs in 4 different feature hashes in our example.
Yet, and this may be considered a problem for short message handling, the input tokens near the end of the message (e.g. token 6 in our example) occurs in only one(1) feature hash result. Unless we would add those extra 'past the end edge' (n+Dx) hashes to the output. Which would remove this objection and introduce more of the previous one we raised here (invariant deadbeef mixed in with reduced input content).
But there is a way out: remember that our maximum chain length is limited (usually, for OSB and friends, it's 5; 'universally', it's
UNIFIED_WINDOW_LEN, which is currently configured to 32). Now what if we 'wrap around' the input, because all we had trouble with and doubt about was that start and end section: what if we simply make the input a circle instead of a line?
Treating the input token stream as a cycle
How to do that? Simple. By:
- remembering the first 'pipeline depth minus 1' number of tokens,
starting the hashing at index 'pipeline depth minus 1' instead of the front @ index 0 (token 1),
padding the end of the input with those 'depth minus 1' tokens from the start.
So our example input would now look like this, in effect:1 2 3 4 5 6 1 2 3 4
As the code would now 'start' to hash at token 5 instead of 1, we always have the same chain length(s) represented by the feature hashes produced, anywhere along the input stream, and each input chain combo is applied to all tokens, also the ones at start and end, because we continue chain-hashing until we reach the last '4' there.
But there's a little hitch there, related to changed order of input tokens, right?
The little hitch in this solution is that you'll get boundary artifacts, due to the end being seemingly 'glued' to a restart of the input, so the hashes related to the start of the messages, will partly exist at the front (the ones, which start at 5, e.g.
(5+4),(5+3),(5+2),(5+1)) and partly at the end, where they mix with the last few end tokens.
However, I think the benefit of this approach over deadbeef-padding is that you've always and ever got a fixed, fully 'input aware' chained input of tokens feeding your [VT] hasher. After all, the last hashes will be:..., (3+2),(3+1),(3+6) ! , (3+5) ! , (4+3),(4+2),(4+1),(4+6) !
which, to my eye at least, is a better mix than (Dx+n) deadbeef-mixes, which contain far less input bits in their resulting hashes.
Sure, it's probably nothing worth doing for large token count inputs, but I expect this to help short inputs classification. This idea has not been tested (yet) - the first public GerH build to offer this will be the BlameBarack release.
(End of Side note / implementation consideration)
And for those who wonder what this 'chains of tokens mixed into a feature' does for you: it basically means you're executing a classification based on 'sentences', rather than 'single words', as classification is feature hash based.
Assume a chain length of 2; features hashes based only on (1+2), and the following snippet from a message you wish to classify:
... blue pill ...
Now you may recall from far above that pure, unadulterated Bayesian was single-word based, so that's two tokens '
blue' and '
pill' and thus two features for ya:
Which we all know is a sure sign of someone trying to peddle something you don't need. Hence, train as spam. mark up +1 for
(blue) and +1 for
(pill). training done.
Next, an important email from a friend comes along, mentioning a completely different 'pill':
... pill ...
Oeps! His/her message will be rated b-b-b-bad if the filter gets his way, because there's this Bad Word:
(pill), in there. Maybe [s]he even mentioned it more than once, so that's one falsely accused of being spam. Likewise for the house redecorator who suggested '
blue' for the new baby room in her email: off to the spam bin with you!
So you moan, then train both good messages, so it's +1 for
(blue) as being good, same for
(pill), and then disaster strikes as the next load of 'blue pill' spam is, at best, marked as 'I dunno' as the verdict is +2 for the bad boys and +1+1=+2 for the good boys. Which is it going to be? Quite a predicament.
Enter 'markovian' (and all the other crm114 classifiers): they take chains of words per feature [hash], and given our little (1+2) feature hash scheme, the above now goes like this:
... blue pill ... --> (blue+pill) --> train as +1 for bad.
Now the good 'pill' and 'blue' messages:
... pill ... --> (pill) -> NO HIT! -> train -> +1 for (pill) as good. ... blue ... --> (blue) -> NO HIT! -> train -> +1 for (blue) as good.
Now the second spam, which got us some serious agony before:
... blue pill ... --> (blue+pill) = +1 bad, NO HITS for good, so it's definitely 'bad' --> kill spam.
Excellent work! By using token chains as input to each single feature hash, crm114 now recognizes 'phrases', i.e. blurbs of text, instead of single words. Which gives us the ability to find out what's in there much more accurately.
(Of course, there's a catch: see our hash sequences further above: instead of just producing (1),(2),(3),(4),(5),(6) the bugger now spits out 4 times as many hashes, so this markovian 'chaining' business increases the number of items to process and consequently requires as much increase in storage space and processing, as a rule of thumb. Alas, storage is rather cheap, and processing power too, if you put that on a balance sheet with your earlier agony on the other side.)
So far, chain lengths and token mixing into feature hashes. Which is the prelude to understanding the 'extra' in Markovian.
The crm114 markovian classifier does not simply add the weights for each feature hit from the CSS databases, but applies a progressive scale to those weights:
Hits for a hash based on (1), will count as +1, for example, yet, a hash based on (1+2) will count as +3 (or another high factor). This goes on for the other mixed items: (1+3) will weigh in at, say, +7, etc. The idea here is that those 'longer distance' mixes are increasingly rare in the input, so hits for them should be rather more important than the hits on simple, i.e. 'short chain distance' hashes. Depending on the internal settings, those extra factors can overpower the 'simple' hits, or not.
To put this in perspective, say you've got the input snippet:
... renew the mortgage on your house ...
where it makes sense that the feature (your+house), i.e. (2+1) is far less conspicuous than (mortgage+house), i.e. the (4+1) hash.
(your house) may be 'good', but a message where mortgage and house are this close together, yet so far apart, is very probably spam, so that (4+1) hash hit should win over the (2+1) hit. And that's precisely what the extra
feature_weight multipliers in crm114's Markovian classifier try to achieve.
That's nice, but a little trouble for the current VT implementation, because the VT code abstracts out this whole token-to-featurehash business, so the classifiers don't have an inkling anymore which hash is based on what combo; it's just a series of hashes coming their way. Which is fine for most other classifiers, but not for Markovian. Sure, given the assumption you're only doing defaults, no custom VT matrix or other fanciness, the core can 'guestimate' which feature was based on what input combination by, in a way, duplicating the VT effort. But that's a very bad (dangerous) kludge!
There's a better way, but that requires an interface change. Let VT deliver those weights as well. Done properly, i.e. a la the customizable
<vector:> matrix spec in there, one can also spec custom weight scales in any crm114 script at runtime then, which saves the R&D folks several
#define tweaks and extra recompiles, as you would need today with crm114: see the crm_markovian.c code for this: various multiplier scales have been tested, but it takes a recompile to switch from one to the other. GerH-BlameBarack comes with an enhanced VT engine, which will accept custom multiplier scales for markovian classifiers (and, if you like, for other classifiers as well, but that's reserved for a future release), assume a default
scale per classifier, so it will mimic current Markovian as close as possible by default, though now with the Bayesian classifier code converted to full VT ability.
It allows you to set up / configure how those tokens are to be mixed into feature hashes. Take our earlier example, then that would be equivalent to this VT 'matrix':
1 3 0 0 0 1 0 3 0 0 1 0 0 3 0 1 0 0 0 3
which just tells VT to mix
(1+2) (1+3) (1+4) (1+5)
in a position dependent manner: the final hash is based on 1 × the first subhash + 3 × the second subhash. This is not weighting the subhashes, mind you, just mixing them in a position-dependent manner using an universal hash paradigm:
1 × (the)+3 × (house) gives a different value than 3 × (the)+1 × (house), you see. Which makes the resulting feature hash dependent on the order in which (the)and(house) input tokens appear.
See Knuth, The Art of Computer Programming, for more info regarding universal hash schemes; for the increasingly curious there's also Sedgewick and several others who have published papers which show why universal hashes are sometimes good and sometimes not. (That's why GerH builds offer alternative hash schemes for the subhashes (token hashes) themselves; the VT still perform a universal hash, but that's not considered harmful as long as the subhashes have an excellent bit dispersion (avalanche), and the numbers in the VT matrixes fit the criteria given by Knuth et al. When you wish to play, start with using prime numbers in there.)
As VT knows which 'row' of the matrix produced a feature hash H, we can easily have VT look up its corresponding 'weight factor', and include that in the info sent back to the Markovian classifier. Ta-da!:
int crm_vector_tokenize_selector ( ...... crmhash_t *features, // where to store the output features int featureslen, // how many output features we can store int *feature_weights, // feature weight factor per feature int *order_no, // order_no (starting at 0) per feature int *features_out // how many feature-slots did I fill for you? );
order_no there stores the order of the feature hash. Just another fancy word for the row number of the VT matrix used to produce all those feature hashes. So order 0 is the hash produced by the first row of the VT matrix.
And it is implicitly assumed that 'more rare' / more important hash mix rows (~ longer Markov chains) appear in higher order rows in the VT matrix. This is not a sure thing, but rather an undocumented convention of sorts. Given our VT matrix sample above, that implies we're assuming row
1 0 0 0 3 to contribute more significant information to the classification process than row
1 3 0 0 0.
For the truly obsessed, there's also Arne's optimization (
ARNE_OPTIMIZATION), which can be driven from the VT by also reporting the row number (and GerH-BlameBarack does that too), but then Arne's optimization only works for matrices where the first VT matrix row would read:
n 0 0 0 0 0 ....
where n != 0, and when you inspect the current 'vanilla' code, you will find that ARNE_OPTIMIZATION is active, and this condition is met, as the current intrinsic VT matrix for the active SBPH option actually is this one:
1 0 0 0 0 1 3 0 0 0 1 0 5 0 0 1 3 5 0 0 1 0 0 11 0 1 3 0 11 0 1 0 5 11 0 1 3 5 11 0 1 0 0 0 23 1 3 0 0 23 1 0 5 0 23 1 3 5 0 23 1 0 0 11 23 1 3 0 11 23 1 0 5 11 23 1 3 5 11 23
Well, it says that when you do not get a hit for your zero order hash, i.e. the hash produced by the first 'VT matrix row', then all subsequent hashes, i.e. the ones from the higher rows, won't have a chance to hit as well. This is true in case the subsequent hashes are all pure extensions, chain-wise, of the initial row's hash. Read as: all other rows include all the initial row's non-zero entries. This is correct for the current vanilla implementation, it's just undocumented.
So, for nitpickers, the earlier statement about needing
n 0 0 0 0 0 ....
for this is not entirely accurate: all it takes is that all non-zero values in the initial matrix row are copied verbatim to all other VT matrix rows, and Arne's optimization can then be used for any Bayesian-based classifier (Markovian, OSB, OSBF, Winnow). Unless I've read the rabbit entrails incorrectly regarding that last bit. :-)
(c)Copyright 2001-2009, Gerrit E.G. Hobbelt (Ger Hobbelt a.k.a. [i_a] ) - Hebbut.Net