Here is the RFC as posted by Bill Yerazunis on the mailing list and the comments. Also includes use case and design diagrams plus rapid prototype source code to underscore the validity of these comments.
Note that this space shows the state of the art as per 1 october 2008. The rapid prototype has not yet been started, due to time delays on other projects which had precedence (mailreaver/mailtrainer review and addition of non-cached learn and train support for those files).
Bill is planning to refactor/redesign(?) current crm114 into a crm114 'shell application' and a 'core' libcrm114 library which offers all classifiers for use in crm114 and other applications.
The original RFC in OpenOffice Writer and Acrobat PDF formats (mandatory as the comments reference this):
I've been pondering for a long time how to approach this as the original libcrm114 idea is a very good idea indeed, but the RFC begs for illustrated comments, which cannot be easily be communicated through limited email (no attachments beyond ~ 40KB) so other means of communication were sought. The wiki would be one channel, but I don't have an edit account there, so I revert to my own trusty web site here.
I hope to direct execution of the libcrm114 idea towards a sufficiently flexible approach which permits the use of a great product (libcrm114) in both high grade 'production environments' and the 'research environment' where it originated.
I strongly believe libcrm114 can be very useful to a lot of users, when both design and implementation adhere to high quality standards. And to make it work out that way, the most important thing we need is a thorough analysis of the needs and requirements for libcrm114 in both a professional world and the research environment. To create an excellent library interface (API) one needs to realize which uses such a library should support and where we should be flexible in our behaviour.
A shot across the bow before we really take off here: the RFC is lacking any type of 'user analysis' (no 'user scenarios' (~ 'use cases')) and without knowing (and reminding ourselves) why we want to do this, there is a high risk of going astray, losing precious resources/effort and delivering a mediocre product at the end.
As the stated 'primary goal' of the RFC is only a 'recode', we must ask ourselves why, not in a technical sense, but in a usage sense. One lesson I was taught early and reminded often about during my software engineering career, is the absolute need to go back to the 'zero point' (your proverbial 'drawing board') when you detect the need for a change in order to overcome your current trouble.
The current foible is 64-bit platform support. Which can be resolved by several judicious tugs and tweaks of the current crm114 source code and some serious stamping on as well. Suffice to say that a large part of this effort is available through the GerH builds, which compile and run on both 32- and 64-bit platforms. Hence:
What trouble? No trouble!
Ah! But there is more! If it were a mere bitsize issue the above would solve that.
Why does Bill want to recode crm114 into libcrm114 then?
Not just for the recode -- do we have that much copious spare time, Bill?
Therefore, let me try another take on the need for libcrm114...
This may sound like a good question (our kids ask 'why' continuously, so there must be something useful in that question after all, n'est-ce pas?), but to start from the 'zero point' it is the wrong question. Then what is the right question?
Not sure yet, but this will get us closer:
Ah! Is the answer here: 'to do a recode'? Or 'to provide 32/64-bit cross platform portability (and maybe even usability)'? No. Sounds an awful lot like 'just for the heck of it' to me, and my time is way to precious for that, yet unconsciously I totally agree that libcrm114 is a good idea. Because it does solve a problem.
Several, I believe.
But they all can be brought back to a single issue at management level (don't we always end up there when the shit hits the fan?):
libcrm114 makes us
on the crm114 core business values again
... and offloads the cute bits so people are
in efficiently using those core values any more.
Too much corporate vibes in that for you? Here's the technical lowdown instead. First:
crm114 core business values are ...
In other words: crm114 classifiers are the answer of choice when you are faced with a tough statistical filtering problem, e.g. spam filtering or content categorization.
libcrm114 is meant to 'go back to these core values': statistics classifiers without any obstructions.
Yes, the crm114 script language is part of the cute bits: I like it a lot - and there is true beauty in it - but that is 'liking' as in liking art like done by Rothko: you've got to see it 'in the flesh' to realize this is extraordinary. But you can only appreciate art like that when you have accomplished your lower goals. A la crm114: I can appreciate the language when I've got my filters working.
crm114 brings in a lot of overhead (which you cannot discard) when used in high performance professional server environments:
(c) Copyright 2001-2009, Gerrit E.G. Hobbelt (Ger Hobbelt a.k.a. [i_a] ) - Hebbut.Net