CRM114 for Win32/UNIX - The future holds ...

Here is what we have planned for the future CRM114 GerH releases regarding features and bugs.

I do not know at this moment if these features will also appear in the mainline - I do hope they will - but that's up to Bill. Meanwhile, I'll move on with the material which has at least one very eager customer waiting for it (that would be me); you may reap the side-benefits from this.

Have fun with this work in progress!

Ger Hobbelt

Note that this space introduces new concepts to a future CRM114; this is not a description of the current state of CRM114, even when some of the bits may at times look like bits of a user manual. Actual CRM114 documentation is included in the downloads: QUICKREF.txt / FAQ / README / etc.

Table of Contents

...TBD...

  1. User Visible

    1. User Manual / [On-line] Help

    2. Classifiers

    3. CSS Database / Classifier Tools

    4. Cross-platform portability and CSS database file use/access

    5. 'transform', 'mutate' and other data transform built-ins

    6. Script Debugging

    7. Variable scoping

    8. Variable classes

  2. Technical

    1. Classifiers

    2. CRM114 script compiler

    3. Re-entrant code

      1. Moving to C++ ?

    4. libcrm114

    5. CRM114 as a server [module]: mod_crm114

    6. Code review

    7. The use of assertions and other defensive programming techniques

The CRM114 compiler / fully reentrant code / libcrm114 / Apache mod_crm114

2008-05-04:

On Sun, May 4, 2008 at 5:05 PM, Bill Y <wsy@merl.com> wrote:
> Ahhh... I looked for it in the download kit; it seems that there's
> version skew between the kit and the "about".

Uhm, you got me there. You mean you saw I didn't update the timestamp in QUICKREF compared to the actual release date? Or??

> I see several things I like and should put into mainline when
> I have some time (as in, after fixing substring compression, the SVM,
> and clumping/pmulc. Then there's the whole question of TF/IDF
> capability and where it fits into the language itself.

TF/IDF = ? (my brain has shut down for today, sorry)

> However- I have a question on the css* commands - some of the commands
> will be exceptionally difficult (practically impossible, actually)
> to implement. Example: cssdiff of two neural networks - I suspect
> the person who figures out how to analyze the actual contents of
> a neural net for nontrivial examples will win some prize from
> the ACM or the AMS. Cssdiff makes sense only when the featuresets

I'll try then. ;-) Nah, seriously though, I just took the collection of utils/tools that come with CRM114 and created a command for each of 'em.
The only 'new' one is cssanalyze which is something *I* want: I don't know yet if it's going to help me or not, but I won't know until I've created the little bugger.

Anyway, the design ideas behind those commands (and the 'mutate' in there):

  1. cssdiff/merge/analyze/...

    Triggered by earlier discussions and the discovery that the tools aren't really 'in sync' with crm114 main, I'd rather have the added diagnostics in there. I know csscreate may be considered a security threat by the more security minded paranoid out there, but I can cope with that, thanks to adding a few more WITHOUT_XXX compile-time defines to ease their pain.
    Meanwhile, I get a chance to get the analysis code very close and personal with the current state of each classifier, including support for versioning/cross-platform CSS file format support like I have in crm_versioning.c. 'cssupgrade' might be another command to enable folks to 'upgrade' from older, now incompatible, crm114 CSS versions with the least amount of fuss. (migrate for 64-bitters is on the far horizon there.)

    For all these commands there is one VERY important adage: if it is not built-in, it MUST NOT pass by silently: a nonfatalerror() is the very least I can do to warn you your script is doing something your crm114 copy won't be able to perform for whatever reason. This also applies to compile-time DISabled (experimental) classifiers, etc.: if it's not there, crm114 gotta yak.

    So my probable inability to make cssdiff say something nice about two Neural Nets will not be harmful. Instead, I'll congratulate the user for trying and tell him crm114 isn't capable of doing this until an ACM-worthy individual comes along. :-) Meanwhile, [s]he /should/ be able to do something useful to the OSB[F] CSS files there.

    In short: it's all about 'orthogonality': I wish to offer these commands at least for _some_ classifiers, yet _all_ classifiers will have it (that's consistency). Unfortunately, some of them will be 'better' at it than others.

    First thing there of course is to add a measure of support for the 'old' production-level classifiers which most folks use: OSB/F/etc. That means merging crm114 with the cssutils it now comes with.

    Then there's cssanalyze, which I'd like to be able to dump the original token texts, together with the hashes, if the planets are properly aligned. Just an idea up there in the brain which I got to see some time. You may and can dispute its usefulness, but let's just say 'cssanalyze' is to be your 'fsck/chckdsk' for CSS files.

  2. invoking tools can be quite costly on some non-too-UNIX-y systems. Besides, I'm more a VMS lover than a UNIX addict from the point of system design, so that might tell you something too.

    Basically, 'mutate' is there to have 'built-in' (plug-in?) data pre/postprocessing available for 'spam/text/data filtering' - in this case it's about both email and news and, well, anything that's written by humans for humans. And maybe even for other stuff too. For instance, one could create a 'Metaphone' sequence for each input word or 'i13l tokenization preprocessing (a.k.a. word splitting) - see for instance 'http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html' for an idea what's going on outside the US in terms of language. China is a big market and I don't see many non-asian antispam folks actively researching the situation where a Chinese / Korean / Indian / etc. Asian user would like to get Asian emails and discard others, though both sets of emails are very hard to differentiate between by simple latin folks like me who can't read what's in there.

    My guess is an [explicitly] Asia-oriented spamfilter might benefit from a helping hand regarding tokenization - unless we have or find filters which are as effective on a per-character base (e.g. Bit Entropy, which is rather language/encoding indifferent as it operates per-bit: the new GerH VT can be made to use a bit/byte/char-wise tokenizer like that, drop it into the VT matrix works and thus merge per-bit tokenizing like BitEntropy with Markovian token-series matching for similarity detection using OSBF or other stable classifiers. If that works, we won't need international word-splitting. And, no, a byte is not a char in this regard: think UTF-8 encoding, which I find hard to 'match' using regexes alone. Hence the new ability for custom tokenizers in GerH.)

    'mutate' is also nice - and that's my first purpose for it - to convert binary formats to something else; as crm114 is sometimes referred to as awk on steroids and amphetamine, think of 'mutate' as 'tr' on the same medication: I like to see what 'metaphone' and a few other systems would do to my 'human readable' feeds. Consider it some sort of research. I wonder. I think I can use something like that. I want to know. Thus: 'mutate'.

> -----
>
> On the other OTHER hand... this is making me think of ressurection of
> the TPA system - Total Plugin Architecture. This was a way to
> add functionality dynamically and conveniently to a high level
> language.
>
> In any case, thanks. It's good to see significant development
> going on.

You're welcome. The only fear I have is that GerH and mainline and constantly diverging as I go on, which can make merging all the harder to do: some changes are actual bug fixes, others are features (for instance, the language compiler has seen some technically important updates: it now actually validates command <xxx> flags against the table in crm_compiler.c and coming up next is adding checks for the arg counts in that table too. Arg parsing is moving to the front as JIT is nice but has its drawbacks even for very large scripts; besides, we're scanning the complete script at the start of crm114 /twice/ already and arg parsing is not costly at all, so I'm moving it up front to also allow early error checking using the other items in that beautiful table in crm_compiler.c - I *hate* interpreted languages which only report language errors when the code actually hits that line during execution: it means such languages DEMAND a 100% test coverage of all possible branches if you desire good production quality. (I sometimes have little nightmares while awake when someone mentions ASP, for example. Hrrr. I've seen many bad things happening on /live/ production boxes that made me cringe as people are generally lazy coders and ditto testers. Some never get any better than the proverbial 'intern' yet still get to work on systems...)

The compiler preprocessor is planned to be discarded as a whole - some steps in direction have already been taken - as that would speed up parsing/compiling large scripts significantly: one less full sweep of the script. Important when your cpu time is not mostly spent in classifiers.
Another move, due to early check+parse of args is the ability to 'merge' APB and MCT, which would reduce heap block count significantly for large scripts. Think of it as doing to the MCT/compiled script what you already did to the 'isolate'd data itself: less malloc, more speed.

And last but not least is the End Goal of all this compiler harassing: a fully re-entrant CRM114 engine. No more bloody globals.
I'm even considering porting the whole thing to C++ for the 'easy way out' as I could make the engine a class and thus have the least amount of worry about the global data structures there.

Why?

Because the End Goal is 'CRM114 server-style', i.e. CRM114 capable to run as a service(Win32)/deamon(UNIX) and accepts input and feeds back output to callers. A bit like spamassassin: deamon + small communication client. There's a little jot somewhere in my notebook of 'far fetched ideas' about creating a mod_crm114, i.e. an Apache module using libcrm114: I get the excellent server behaviour and HTTP/XXXX communication skills of Apache and the scriptability+classification ability of crm114. Cross-platform because Apache is available for (almost?) all platforms than might consider using crm114.

Which means 'libcrm114' would be Very Nice To Have(tm). Then I can have my dearly wanted mod_crm114 and thus plug it into my existing communication system, emails and other data feeds can be 'PUT' uploaded to this server (or you create extra comm commands a la WebDAV).
Precluding this is the need for reentrancy, as the crm114 will then run in a multithreaded core. No place for global vars the way we have now.

User Visible

User Manual / [On-line] Help

updates to QUICKREF (done for the most part already).

Can we 'revive' or otherwise update the CRM114 book? Copyright, etc.

Classifiers

Add custom tokenizers for specific format binary data (32-bit floating point / integer inputs). CRM114 is advertized to be flat 8-bit and no NUL-sentinel dependencies, but we still need to test that really. It would save quite a bit in useless CPU load regarding printf()-ing the bloody input and then scanning those texts back to number 'tokens' in the classifiers and elsewhere in the CRM114 scripts.

CSS Database / Classifier Tools

Initial move done: added built-in script commands for all available CSS utils.

Now 'all we've got to do' is implement them for each classifier; at least for the ones where we care or can.

Cross-platform portability and CSS database file use/access

My versioning header is much more than that: at a copious 16Kwords it has room for some other stats apart from all the little tricks to detect Endianess and word size of the given CSS file; the intent here is to allow CRM114 to detect and cope (by user warning or otherwise) with CSS files that originate from another system.

Think cheap 'server farms' where different architectures share a network and CSS files.

Current idea for a 'distributed spam filter system' which should be simple and absolutely safe from file locking issues and related corruption:

Trouble is that network-shared CSS files will be corrupted at a certain point because there's no way I can absolutely guarantee file locking will work properly all the time. That's why database servers exist and really smart people spend a lot of time on network locking algorithms and implementations. Since I'm not Oracle and do not want to restrict users to a 'cleared' list of server hardware and software, the problem aggravates if I don't 'go around'. So:

we copy the CSS files from the 'learn' box to the 'classify' boxes. Do not overwrite.

Then each classify box has a nice little cron job or other device to schedule this sequence: take classifier off-line, move new CSS database(s) to active location, i.e. overwrite existing ones, check CSS files and where needed fixup the Endianess and/or word-size issues in there (hopefully none) using a little CRM script and those css* script commands we've got, when done, take the classify box back on line.

The classify boxes are located behind a load distributing little network gadget (Cisco? May be Linux box for ultra-cheapness) which will detect the 'being-down-ness' of a classifier machine, so incoming high traffic will be distributed over the other boxes while this baby upgrades.

Next in line/time are the other classifier boxes.

Drawback: 'learn' does not have an immediate effect as there's a little time delay in there, but I can argue very well that that minor delay is nothing compared to the human intervention delay preceding the learn operation: human response time when 'on the mark' is .3 seconds; you can't get that 24/7, see chemical and nuclear plants where human intervention may take - at best - several seconds. Can you imagine hiring someone to watch this system that close all day, every day? So human response times will be worse than that. What's a little extra few (sub?)seconds due to this mechanism then?

'transform', 'mutate' and other data transform built-ins

Metaphone and other algorithms built-in to 'tr'-like transform input.

An experiment really. Gotta see if it helps some of my feeds.

Script Debugging

This baby is already well underway or maybe even done already; everything I need at this moment is in there:

The debugger now also has the ability to run 'macros'. Just hip talk for a multi-command debugger command lines, like my favorite niex;v>5 or the even wickeder niex;v>5;. which is auto-repeating itself until you hit a snag. Sort of like 'c' but cuter, especially when you disable 't' before because you don't feel like looking at that bit of info too.

Anyway, this is not the future anymore. This future is here.

Variable scoping

I've got several headaches on end because every time I had to find out again that CRM114 actually does not have variable scopes: it may have call but it sure as Hell has no (function-)local variables. I am very tempted to introduce those to CRM114, so I can isolate variables in function-local scope and not suffer from name clashes with other spots where the same variable name is used for completely other purposes. And - of course - that would happen to happen in callable 'routines' I may use.

Imagine having this feature of 'no scope' in C or C++ and try using a third-party source code library like, say, maillib.crm. Aw, my head!

Variable classes

Nothing I came up with. See Paolo & Bill's conversation on crm114-developers about how to return multiple results from a classifier, so you wouldn't need extra regex goodness just to get at that pR or other item produced by the classifier.

Looks a bit like class elements, e.g. result.pR, etc.

I like it.

Paolo also provided some Proof of Concept code so we might want to run with that. A properly pulled off heist is in order.

Technical

Classifiers

Unify OSB/F/... etc. to a single codebase.

R&D I like to do: replace the linear probe hashing system in there with bucketized cuckoo hashing.

GerH already comes with several experimental hash functions to apply to tokens; continue that and add some proper tests for those routines.

This is aimed at very large data sets where collisions in the CSS file are becoming a bit of a bother - at least for me. Reported linear probe chains of 600 and more give me the heeby-jeebies. CRM114 can do better than that.

CRM114 script compiler

Merge preprocessor and compiler phase into one. The goal: reduce number of full scans of the script; this is costly for large scripts.

There's a <cracks knuckles at compiler tech> ego factor in there, though. Ah well.

Improve script code error checking up front - I want to know what I coded wrong and where before it even starts to run. Yes, I like my compiled languages, or at least those languages which receive a full syntax check before execution commences. Takes care of a lot of avoidable bugs that way.

And about the where: currently CRM114 first preprocesses the script code, including imports, and does not keep track of actual source file line position while doing that, so script errors always come with line numbers which are, to put it mildly, a bit off.

Re-entrant code

The number of global variables is worrying me, given my future plans for crm114: multithreaded library code to be used in servers.

And don't get me started on fork()ing as a 'solution' for this.

That means the globals should be merged into a little 'crm114 execution' structure of some sort, which is then fed to/accessed by the code in a re-entrant way. No fancy footwork with TLS (Thread Local Storage) and some such, puh-leeze.

Moving to C++ ?

Given the above, the idea hit me that, if I moved the code to a rudimentary form of C++, I'd have the globals as class data members, the code as class methods and the whole shebang would be re-entrant to boot. If done right.

C++ is available on all the target systems today, so as long as I don't go for the added bonus of exception handling (which is at times frowned upon when you want to go embedded system) and keep the number of virtual methods down (the new VT custom tokenizer code is just begging for some) and stay off the templates, it might may be even sellable to Bill. Right. In another universe, probably?

libcrm114

That's where I want CRM114 to go. Just another way of saying: commandline interface separate from the core (script + classifiers - I don't want nor need to separate script and classifiers; besides that last bit would take all the cuddliness out of CRM114 and it deserves some cuddliness - it's why I selected it for my own projects in the first place: good results, compiled code (no ASP or buggeritt like that), definitely no perl5 cruft: I have a company policy which prohibits the use of perl. We're rather emotional in this, yes. Though I might find the reason for that again if I was willing to go and look.

Anyway, libcrm114 is the stepping stone for ...

CRM114 as a server [module]: mod_crm114

Exactly!

spamassassin is a good idea, only the perl in there. Yech. So it's mod_crm114 time.

What's mod_crm114, you ask?

mod_crm114 is an Apache2 module. To be. Maybe a bit like WebDAV or SVN: Apache2 does the servery stuff it does so amazingly well - and did I mention cross-platform too!? - so we just have our CRM114 hooked in there as a plugin which can handle HTTP-alike requests.

That comes with a little bit of a client which just passes on data from anywhere (procmail / etc.) to the CRM server, the server munches the data and classifies, learns or whatever the local script patron (script selected through URL, oh yummmmmm) decides and the result is spat back across the line to the client, so it can continue on it's merry way.

Now that is what I'd call 'web services'. <rant>Not the hip stuff which is just RPC packaged in XML, correction, readable function names and parameters in a text format fed across the line to a processing system which converts those names to calls (dispatcher). So 1960's. Yet it's all the buzz because because it's got XML. Where's performance gone? Right, just like the teacher said in CS class: if it's too slow, buy bigger hardware. Nowadays, we go for bigger farms and distributed then, because single iron ain't hot enough no more.</rant> Oh shoot, <rant> isn't a HTML tag...

Benefits:

Drawbacks:

Code review

We're at it. We're at it. Whenever I run into something, I don't fix only that spot, but scan the complete codebase for similar buggers (and I'll keep my fingers crossed while stating this). It's just that only a few parts of the code (compiler, a few classifiers) have received my undivided attention for a longer period of time.

Just remind myself I should rather list this as:

Describe what CRM114 does in technical terms. User documentation for the technology savvy

... for those who need to apply CRM114 to non-email-spam problems and need a bit more info than It Works, Hurray!™

The use of assertions and other defensive programming techniques

Well, not the future of CRM114 GerH as it's already there. CRM_ASSERT() and paranoid error checking are a daily exercise here at the Hobbelt Manor.

The future in there is rather:

When can we finally say it's a stable release? Test suits, anyone? Currently

make check 

nor

make megatest 

comes with a mailreaver / mailtrainer / mailfilter test set, but that will probably change in the nearer future as I'm working on one.

On ASSERT / VERIFY:

Despite Bill Yerazunis aversion against ASSERT/VERIFY checks, I'd rather have them than none at all. I've met too many developers that have plenty excuses not to use them, and, yes, I've been one of those developers too for too long, and, yes, they're only a partial and inadequate solution to a problem aggravated by the choice between run-time execution speed and run-time checking, but the choice between retentive checking or utter run-time performance is placed into the hands of the [source distribution] user of the GerH builds.

As argued over the years by several well-known people in magazines like C/C++ Users Journal and Dr.Dobbs and on other public fora (books, internet), the basic assert() is not the pinnacle and end-all in terms of code execution validation: in these and other developer magazines several proposals/examples have been published to make developers throughout the world aware of the possibility to customize those macros to suit your needs: the CRM_ASSERT() macro is one such customization, tailored to fit the needs particular to code execution validation in the CRM114 context. I do not see any professional reason not to use them.

See the GerH source code how to disable these macros I you so desire: they are enabled in 'release builds' by default. Also see ./configure --enable-trappable-assertions for an easy way to convert those 'developer checks' to full-fledged script-controllable run-time checks in run-time environments where CRM114 scripts should have more control over fault reports generated by any of these macro's. Personally, I advise against this practice because CRM_ASSERT() is used to check basic and complex software run-time state premises, which are not easily remedied by scripted traps when found to fail.

Despite this difference in the details of Bill's and my opinion, the fact is that however you wish to use/compile them: it is to be your choice and yours alone.

GerH releases offers you both (and other) ways to cope with this.