CRM114 for Win32/UNIX - BlameBarack-GerH for connoisseurs

New stuff being served...

Why GerH builds have a 'alt.markovian' and other 'alt.*' classifiers: those are classifiers using the latest VT technology and classifier optimizations described here. What rejoicing's got to do with the humble WINNOW classifier. Where the crm114 script language gains a concept (variable scope) aiding you when you cross that KLOC boundary where you can't keep it all in your head anymore (such as with mailreaver et al).

Ger Hobbelt

Note that this space shows the state of the art as per 19 february 2009. Any source code in this article shows the state as per that date (or earlier); their value is purely educational. Fetch a GerH build if you like to have this stuff in a usable form.

The 2009/02/19 GerH crm114 release

Update: as per 2009/02/19, there's a complete release available, including prebuilt binaries for Windows users. As per that date, the 'alt.*' classifiers are no longer 'experimental', a state in which these already have existed within the crm114 source and binary distributions since 2008.

2009/02/09 experimental source-only release

This wasn't so much a release as rather a preview of one coming up: as such, it's the first bit of GerH code published after it's officially become a branch.

Mind you, this was for playing around, having a sniff, but does not pretend to be production worthy in any way. The fact that the old bugger failed it's own extended 'make check' test set should say enough: Mr. Hobbelt didn't do his usual job there and then.

Nevertheless, I think it was still good enough to get it out there. First of all, it's got some new stuff that's not in the vanilla mainline. Then it's also got some fixes for bugs of my own and several issues which exist in mainline till this day.

And, no, there's no libcrm114 in here, yet. Alas, we've got the new classifier derivatives alt.markovian, alt.osb, alt.osbf and alt.winnow which come with full VT support in a big way. It took some modifications to the VT engine to make this work, the < clip: x y z > and < weight-model: name > attributes are now in there too [per 2009/02/19], as well as the  < vector: w h d m0 m1 m2 ... > and < weight: w h m00 m01 m10 m11 ... > matrices, for the first time ever.

Lowdown for the user

What follows further down is rather technical at times, so here's the low down management summary style:

VT 2009: Vector Tokenization

(or how VT now also supports cross-cut Markovian/Bayesian classifiers)

Here's the original VT description by Bill Yerazunis. That's what I call VT 2007.

Now this is VT 2009. It is VT 2007, with some additional twists and an engine that has been rewritten from the ground up:

Production classifier alternatives with full VT support

See above. From now on, you can write things like

classify <osb> // /vector: 7 3 1   1 3 0 0 0 0 0   1 7 11 13 19 0 0   1 17 0 0 43 57 61/ (class1.css | class2.css) [:input:]

a crm114 classifier feature which was, until now, only available with Hyperspace and the experimental classifiers, such as SKS, SVM and Neural Net.

Finally, crm114 script has scoped variables!

Yes, this is something for all of you who write more than a few lines of crm114 script: finally, the time all variables you created anywhere are globally accessible - a real horror when you are include-ing script files and/or call large functions, such as is happening in the mailreaver / mailfilter / mailtrainer scripts.

Now, variables are scoped. How?

When you create a variable, e.g. when writing a line like this:

isolate (:my_var:) //

the scope of the variable is restricted to the call depth level (which can be retrieved by referencing the built-in :_cd: variable, by the way).

The impact

When you wrote code using routines in crm114 script (that is: start a bit of code with a :label:, end it with a return statement and have it call-ed from other parts of your script, any new variables isolate-d or otherwise created after the call will be limited to the scope of that call.

Thus, to return results from these functions, you either return the result or ensure you already have created the variables before the call.

A few examples to show what I mean

A few examples will help in understanding this, I'm sure.

Formerly, you could write stuff like this:

  isolate (:a:) /x/
  call /:b:/
  output /:*:c:\n/    # output "hello!\n"
  exit 0

:b:
  isolate (:c:) /hello!/
  return

which would assign 'hello!' to the global variable :c: (as all variables were created with global scope) and permit the caller to access :c: once the :b: routine returned.

Of course, for single return values, there already existed the cleaner solution:

  isolate (:a:) /x/
  call /:b:/ (:c:)
  output /:*:c:\n/    # output "hello!\n"
  exit 0

:b:
  return /hello!/

but what if you were constructing a routine which produced several results all at once?

Using regexes to split the single return-ed value into those separate entities again feels - and is - way too much overhead for reasons of furthering the good cause of good (machine) language design.

Also, thanks to the global, single scope for all variables, you could inadvertently (or advertently) use and abuse its side effects by writing code like this:

  isolate (:a:) /x/
  call /:b:/ (:c:)
  output /:*:c:? :*:d:!\n/    # output "hello? hello!\n"
  exit 0

:b:
  isolate (:d:) /hello/
  return /:*:d:/

Note the use of :d: in the caller code (output) at the top. Doesn't sound too bad? Well, have a look at this:

  isolate (:a:) /x/
  call /:b:/ (:c:)
  isolate (:d:)  # create var :d: -- or so you thought!
  output /:*:c:? (:*:d:)\n/    # output "hello? (hello)\n"
  # ^ without :b: side-effects, expect output:
  #      "hello? ()\n"
  exit 0

:b:
  isolate (:d:) /hello/
  return /:*:d:/

Note the isolate-ion of :d: in the caller code: it is an often used 'shorthand' for

isolate (:d:) //

which is even used as such by the master himself (Bill Yerazunis), while this 'shorthand' may look cool to lazy people but has a little nick: in actual reality, it is a shorthand for:

isolate (:d:) // <default>

which says: create variable :d:, iff it doesn't exist already, and assign it an empty value, again, iff it doesn't exist before. Quite a different thing, eh?

The resulting error, not a code error, but a programmer assumption error, in the example script above is obvious.

Now consider scripts at sizes similar to, say, mailreaver. Which includes a huge chunk of code, called maillib.crm. Which already has/had several of such lucky bits in its code: fortunately those were harmless.

No more wicked side effects

Ok, fine, so variable scoping kills the use of side effects like that. But what if you were constructing a routine which produced several results all at once?

Never feared, though: as the scoping applies to variables as they are created, it implicitly means you can access all variables which have been defined in outer scope, as well as all local created variables. Thus, the previous example would now be coded like this:

  isolate (:a:) /x/
  isolate (:d:)  # create var :d: -- still, remember this is 'isolate (:d:) // <default>'!
                 # it means one can invoke this script as 'crm114 this.crm --d=yo' and
                 # thus have :d: 'preset' to "yo" by crm114.
  call /:b:/ (:c:)
  output /:*:c:? (:*:d:)\n/    # output "hello? (hello)\n" -- as expected.
  exit 0

:b:
  isolate (:d:) /hello/        # overwrites outer scope :d: value
  return /:*:d:/

By making sure the variables, manipulated by function :b:, already are 'declared' (created) by the caller (through using those isolate statements or otherwise) it makes reading the code a bit easier as well: no more variables which spontaneously pop out of thin air - a habit some of them seem to have when you're crossing that LOC boundary where everything in there doesn't fit into your head all at the same time any more...

Of course, the previous examples hinted as much, but variables can be accessed and altered within their scope and any 'sub scope', i.e. call-ed subroutine:

  isolate (:a:) /x/
  call /:b:/
                     # cannot access :c:
                     # cannot access :e:
  exit 0

:b:                  # can edit :a:
  isolate (:c:) /y/  # scope: call depth >= 1
  call /:d:/
  return             # :c: is lost for ever

:d:                  # can edit :a:, can edit :c:
  isolate (:e:) /z/  # scope: call depth >= 2
  return             # :e: is lost for ever

Note that, as before, a fault (i.e. trap handling) does not 'pop' call depth, i.e. scope level. Thus, any faults issued from a subroutine call, will always be trap-handled at that same call depth, no matter which trap code line catches the fault. This, incidentally, is one of the major reasons why return at call depth = 0 (i.e. the main code) acts as the equivalent of exit.

Of course, a trap handler can inspect the internal :_cd: variable to see if a return or exit is in order, callstack-wise, but that is an effort seldom found (or needed) in crm114 scripts.

mailreaver / mailtrainer / mailfilter unified

Finally, mailfilter and mailreaver have become one, while still each supporting their regular set of parameters. Check out

mailreaver --help
mailfilter --help
mailtrainer --help

for more info.

Suffice to say that all of the above are now completely into using the maillib.crm mail processing library for CRM114.

mailtrainer has seen some enhancements as well, which allow for further ability to tune the training of your email message stream.

WINNOW classifier and mailreaver / mailtrainer / mailfilter

Well, that one was a nice surprise. Since I haven't used mailreaver myself before (crm114-GerH is used in different environments), nor used the WINNOW classifier for anything but the obligatory megatest and make check tests, I hadn't noticed it before, but now that I've been running additional tests on the classifiers, I noted a wicked thing that is entirely missing from the crm114 documentation:

WINNOW only performs well when trained 'double sided'

That is: you have to train WINNOW addressing each class, training regularly into the target CSS class and <refute> training into all the other CSS classes.

And guess what? mailfilter.crm and mailreaver.crm (let alone mailtrainer.crm) was never able to do this, so when you'd tried to use WINNOW for your mail filtering, you'd quickly find out the 'default' OSB classifier would outperform WINNOW. Which isn't exactly what you'd find when you checked the published TREC whitepapers.

GerH mailtrainer (and thus mailfilter and mailreaver) now supports consistent double-sided training, which you should use with WINNOW.

Important: note that doublesided training deteriorates the performance of all the other Bayesian/Markovian classifiers, so make sure you set that flag properly in mailfilter.cf.

Note: I am considering building an 'auto-sense' into mailtrainer et al, which turns on that doublesided training by automatically when you pick WINNOW or ALT.WINNOW as your classifier of choice for your email filtering. That would reduce the risk to incorrectly setting up that (and other) classifiers.

Now that my tests turned this up, I checked vanilla and had this A-ha! Erlebnis (plus Engrish translation) you often get when reading classic UNIX man pages: by the time you've found out yourself, the writing clearly states you should do it precisely like that. Same with crm114, when you check Bill's megatest.sh. My rejoicing in this rediscovery knows no bounds.

Yes. I've fallen so deep vanilla's megatest.sh is now... documentation.

Other bits 'n pieces:

Known issues