Why GerH builds have a 'alt.markovian' and other 'alt.*' classifiers: those are classifiers using the latest VT technology and classifier optimizations described here. What rejoicing's got to do with the humble WINNOW classifier. Where the crm114 script language gains a concept (variable scope) aiding you when you cross that KLOC boundary where you can't keep it all in your head anymore (such as with mailreaver et al).
Ger Hobbelt
Note that this space shows the state of the art as per 19 february 2009. Any source code in this article shows the state as per that date (or earlier); their value is purely educational. Fetch a GerH build if you like to have this stuff in a usable form.
Update: as per 2009/02/19, there's a complete release available, including prebuilt binaries for Windows users. As per that date, the 'alt.*' classifiers are no longer 'experimental', a state in which these already have existed within the crm114 source and binary distributions since 2008.
2009/02/09 experimental source-only release
This wasn't so much a release as rather a preview of one coming up: as such, it's the first bit of GerH code published after it's officially become a branch.
Mind you, this was for playing around, having a sniff, but does not pretend to be production worthy in any way. The fact that the old bugger failed it's own extended 'make check' test set should say enough: Mr. Hobbelt didn't do his usual job there and then.
Nevertheless, I think it was still good enough to get it out there. First of all, it's got some new stuff that's not in the vanilla mainline. Then it's also got some fixes for bugs of my own and several issues which exist in mainline till this day.
And, no, there's no libcrm114 in here, yet. Alas, we've got the new classifier derivatives
alt.markovian,alt.osb,alt.osbfandalt.winnowwhich come with full VT support in a big way. It took some modifications to the VT engine to make this work, the< clip: x y z >and< weight-model: name >attributes are now in there too [per 2009/02/19], as well as the< vector: w h d m0 m1 m2 ... >and< weight: w h m00 m01 m10 m11 ... >matrices, for the first time ever.
What follows further down is rather technical at times, so here's the low down management summary style:
Now offers proven production classifier technology (Bayesian/Markovian) combined with the latest in feature construction (VT / Vector Tokenization), which allows us to tune these classifiers for maximum performance in ways that were impossible before outside the R&D lab).
Comes with a big help for crm script writers who are developing larger crm script applications / scripts: a tiny feature called variable scoping. Which is a time-tested technology available with all other major (and minor) programming languages available to us for the past several decades. (Wulf and Shaw might be a little happier for you now...)
maillib.crm library code. This means mailfilter users now share all the benefits with the mailreaver folks - and vice versa.Same command line. All the extras.
Plus, last but not least, those mail scripts now finally add support for the winnow classifier, which performs significantly better than the default OSB classifier in some email environments.
(or how VT now also supports cross-cut Markovian/Bayesian classifiers)
Here's the original VT description by Bill Yerazunis. That's what I call VT 2007.
Now this is VT 2009. It is VT 2007, with some additional twists and an engine that has been rewritten from the ground up:
Weight/order models help bump up the score for important, context-sensitive feature hits.
Markovian/Bayesian classifiers use two hashes to represent a single feature (a feature which was already anticipated and built into VT 2007) but Markovian and the others also weight the feature hits depending on the order of the feature: features which include more context (in a sparse bigram sense: mixes tokens which are farther away from each other) are more important than features, which contain less context. This is done by providing each feature with an order number (which simply is it's row number in the VT matrix, where row numbering starts from the top down), which corresponds to a certain weight.
Various models for this have been published in literature (Super-Markov, Breyer-Chhabra-Siefkes, ...); this release [2009/02/19] offers a choice between these to help you with your own research, instead of only offering the default CRM114 weight model. This ability is entirely absent in the vanilla crm114 releases.
VT 2009 includes code to enhance all classifiers with the <unique> attribute in a unified manner, using the Hyperspace approach.
The old Markovian/Bayesian classifiers handle this a little different, implementation-wise: instead of sorting and unique-filtering the feature hash stream, these classifiers kept a second, large, lookup table around to check if a feature has already been discovered in the database under test. However, it is felt that the Hyperspace approach both saves memory and performs similarly to the older vanilla Bayesian/Markovian approach.
By moving <unique> handling into the VT engine, this feature now has also become readily available for other, more sophisticated classifiers, such as SVM, SKS and Neural Net at no cost.
See above. From now on, you can write things like
classify <osb> // /vector: 7 3 1 1 3 0 0 0 0 0 1 7 11 13 19 0 0 1 17 0 0 43 57 61/ (class1.css | class2.css) [:input:]
a crm114 classifier feature which was, until now, only available with Hyperspace and the experimental classifiers, such as SKS, SVM and Neural Net.
Yes, this is something for all of you who write more than a few lines of crm114 script: finally, the time all variables you created anywhere are globally accessible - a real horror when you are include-ing script files and/or call large functions, such as is happening in the mailreaver / mailfilter / mailtrainer scripts.
When you create a variable, e.g. when writing a line like this:
isolate (:my_var:) //
the scope of the variable is restricted to the call depth level (which can be retrieved by referencing the built-in :_cd: variable, by the way).
When you wrote code using routines in crm114 script (that is: start a bit of code with a :label:, end it with a return statement and have it call-ed from other parts of your script, any new variables isolate-d or otherwise created after the call will be limited to the scope of that call.
Thus, to return results from these functions, you either return the result or ensure you already have created the variables before the call.
A few examples will help in understanding this, I'm sure.
Formerly, you could write stuff like this:
isolate (:a:) /x/ call /:b:/ output /:*:c:\n/ # output "hello!\n" exit 0 :b: isolate (:c:) /hello!/ return
which would assign 'hello!' to the global variable :c: (as all variables were created with global scope) and permit the caller to access :c: once the :b: routine returned.
Of course, for single return values, there already existed the cleaner solution:
isolate (:a:) /x/ call /:b:/ (:c:) output /:*:c:\n/ # output "hello!\n" exit 0 :b: return /hello!/
but what if you were constructing a routine which produced several results all at once?
Using regexes to split the single return-ed value into those separate entities again feels - and is - way too much overhead for reasons of furthering the good cause of good (machine) language design.
Also, thanks to the global, single scope for all variables, you could inadvertently (or advertently) use and abuse its side effects by writing code like this:
isolate (:a:) /x/ call /:b:/ (:c:) output /:*:c:? :*:d:!\n/ # output "hello? hello!\n" exit 0 :b: isolate (:d:) /hello/ return /:*:d:/
Note the use of :d: in the caller code (output) at the top. Doesn't sound too bad? Well, have a look at this:
isolate (:a:) /x/ call /:b:/ (:c:) isolate (:d:) # create var :d: -- or so you thought! output /:*:c:? (:*:d:)\n/ # output "hello? (hello)\n" # ^ without :b: side-effects, expect output: # "hello? ()\n" exit 0 :b: isolate (:d:) /hello/ return /:*:d:/
Note the isolate-ion of :d: in the caller code: it is an often used 'shorthand' for
isolate (:d:) //
which is even used as such by the master himself (Bill Yerazunis), while this 'shorthand' may look cool to lazy people but has a little nick: in actual reality, it is a shorthand for:
isolate (:d:) // <default>
which says: create variable :d:, iff it doesn't exist already, and assign it an empty value, again, iff it doesn't exist before. Quite a different thing, eh?
The resulting error, not a code error, but a programmer assumption error, in the example script above is obvious.
Now consider scripts at sizes similar to, say, mailreaver. Which includes a huge chunk of code, called maillib.crm. Which already has/had several of such lucky bits in its code: fortunately those were harmless.
Ok, fine, so variable scoping kills the use of side effects like that. But what if you were constructing a routine which produced several results all at once?
Never feared, though: as the scoping applies to variables as they are created, it implicitly means you can access all variables which have been defined in outer scope, as well as all local created variables. Thus, the previous example would now be coded like this:
isolate (:a:) /x/
isolate (:d:) # create var :d: -- still, remember this is 'isolate (:d:) // <default>'!
# it means one can invoke this script as 'crm114 this.crm --d=yo' and
# thus have :d: 'preset' to "yo" by crm114.
call /:b:/ (:c:)
output /:*:c:? (:*:d:)\n/ # output "hello? (hello)\n" -- as expected.
exit 0
:b:
isolate (:d:) /hello/ # overwrites outer scope :d: value
return /:*:d:/
By making sure the variables, manipulated by function :b:, already are 'declared' (created) by the caller (through using those isolate statements or otherwise) it makes reading the code a bit easier as well: no more variables which spontaneously pop out of thin air - a habit some of them seem to have when you're crossing that LOC boundary where everything in there doesn't fit into your head all at the same time any more...
Of course, the previous examples hinted as much, but variables can be accessed and altered within their scope and any 'sub scope', i.e. call-ed subroutine:
isolate (:a:) /x/
call /:b:/
# cannot access :c:
# cannot access :e:
exit 0
:b: # can edit :a:
isolate (:c:) /y/ # scope: call depth >= 1
call /:d:/
return # :c: is lost for ever
:d: # can edit :a:, can edit :c:
isolate (:e:) /z/ # scope: call depth >= 2
return # :e: is lost for ever
Note that, as before, a fault (i.e. trap handling) does not 'pop' call depth, i.e. scope level. Thus, any faults issued from a subroutine call, will always be trap-handled at that same call depth, no matter which trap code line catches the fault. This, incidentally, is one of the major reasons why return at call depth = 0 (i.e. the main code) acts as the equivalent of exit.
Of course, a trap handler can inspect the internal :_cd: variable to see if a return or exit is in order, callstack-wise, but that is an effort seldom found (or needed) in crm114 scripts.
Finally, mailfilter and mailreaver have become one, while still each supporting their regular set of parameters. Check out
mailreaver --help
mailfilter --help
mailtrainer --help
for more info.
Suffice to say that all of the above are now completely into using the maillib.crm mail processing library for CRM114.
mailtrainer has seen some enhancements as well, which allow for further ability to tune the training of your email message stream.
Well, that one was a nice surprise. Since I haven't used mailreaver myself before (crm114-GerH is used in different environments), nor used the WINNOW classifier for anything but the obligatory megatest and make check tests, I hadn't noticed it before, but now that I've been running additional tests on the classifiers, I noted a wicked thing that is entirely missing from the crm114 documentation:
That is: you have to train WINNOW addressing each class, training regularly into the target CSS class and <refute> training into all the other CSS classes.
And guess what? mailfilter.crm and mailreaver.crm (let alone mailtrainer.crm) was never able to do this, so when you'd tried to use WINNOW for your mail filtering, you'd quickly find out the 'default' OSB classifier would outperform WINNOW. Which isn't exactly what you'd find when you checked the published TREC whitepapers.
GerH mailtrainer (and thus mailfilter and mailreaver) now supports consistent double-sided training, which you should use with WINNOW.
Important: note that doublesided training deteriorates the performance of all the other Bayesian/Markovian classifiers, so make sure you set that flag properly in mailfilter.cf.
Note: I am considering building an 'auto-sense' into mailtrainer et al, which turns on that doublesided training by automatically when you pick WINNOW or ALT.WINNOW as your classifier of choice for your email filtering. That would reduce the risk to incorrectly setting up that (and other) classifiers.
Now that my tests turned this up, I checked vanilla and had this A-ha! Erlebnis (plus Engrish translation) you often get when reading classic UNIX man pages: by the time you've found out yourself, the writing clearly states you should do it precisely like that. Same with crm114, when you check Bill's megatest.sh. My rejoicing in this rediscovery knows no bounds.
Yes. I've fallen so deep vanilla's megatest.sh is now... documentation.
tenfold validation revisited
The old tenfold validation script has got a new and improved brother (tenfold_validation_ex) which allows for different training approaches, among other things.
more tests, better quality
'make check' now includes several more tests to ensure crm114-GerH quality of engineering. That's also part of the reason why the previous 2009-02-12 'experimental quicky release' failed that test set: it was not up to par, yet.
By the way, I still don't understand why vanilla crm114 does not make a difference between the notion of test inputs and apparent documentation: any change in the latter has an immediate and very significant adverse effect on doublechecking / validating crm114 in terms of classifier result quality and reproducibility (and that's not only between different platforms, but also between different vanilla crm114 releases). Had the discussion back in '07 and given the unsatisfying answers GerH builds continue to use separate input documents for the tests.
The only reason why I change those *.input documents with my releases is to mimic BillY's vanilla crm114 classifier tests. This is a known malfeature due to my desire to 'stay in sync': by acting thus I produce the same cross-version validation disaster you get with Bill's. I'm sorry for that.
...
variable scoping: variables in local scope are not yet discarded as they should:
isolate (:a:) /x/ call /:b:/ call /:c:/ exit 0 :b: isolate (:d:) /y/ # scope: call depth >= 1 return # :d: is lost? Or is it...? :c: isolate (:d:) /z/ <default> # <-- will discover :d: is still around @ call depth = 1! output /d=:*:d:\n/ return
This can be easily resolved by tracking all newly created variables in a list attached to the relevant (generally that is: the current) CSL. A list is good enough as the VHT lookup code will already have determined if the variable exists or not, so adding this tracking bit to the appropriate bit of code in there, ensures only really freshly created variables are added to the list.
Removing those variables means we'll need to actually destroy them when return arrives, yet destroying variables is a thing yet unknown to the crm114 script realm, so there may be a hitch or two in the code before we got that one covered.
Heck, the usual.
Exactly like with vanilla crm114, CLUMP is still utter coredump crud; Neural Net too has a fondness for coredumping without a bungee when put to the test ('fail to resolve' is fine with me, but dumping core...), The 'correlate' classifier popped up on the horizon of WTF when finally subjected to some serious tests, thanks to complete lockup during its training stress test... Nothing which excites me all that much anymore, either way. Sigh. It is as it is. We'll get there when we get there. Though I hope Bill is faster in getting there.
(c) Copyright 2001-2009, Gerrit E.G. (Ger) Hobbelt - Hebbut.Net