CRM114 for Win32/UNIX - The GerH VT code

New stuff being served...

Here is the write-up (rudimentary) of the concept upon which the rewritten VT code has been based.

Ger Hobbelt

Note that this space shows the state of the art as per 18 september 2008. The source code files are NOT updated after that and stand on their own; their value is purely educational. Fetch a GerH build if you like to have this stuff in a usable form.

Table of Contents

  1. The basic idea that drove the rewrite

    1. The parts that make VT
  2. The implementation
    1. Setup & Cleanup
    2. The Tokenizer
      1. The default tokenizer() method
      2. OO tokenize() method versus 'C' obj->tokenizer() function
    3. The Mixer
    4. Performance considerations: burst mode tokenization and mixing
  3. In closing...


The last few CRM114 releases by Bill Yerazunis contain VT (Vector Tokenizer) support for several classifiers. This text does not concern itself with the benefits of having VT available, but focuses on the rewrite by me and the technical design background.

NOTE: this is not a finished job! There's still a few little ha's and hm's in the code that I want to look at, most particularly thread-safe error reporting. In short, passing all the relevant globals around as function arguments (vht, tdw) is simply exchanging one mess for another; these should be packaged in a suitable container, if only to reduce the number of function arguments, in order to improve the call interface readability.

The files

Compressed download:

New stuff being served...

Download the 7zip archive or the zip archive with all the necessary goodies for reading the code: crm114.h (stripped down) and crm_vector_tokenizer.c.

The basic idea that drove the rewrite

... is that I wanted to have a VT implementation which would be able to pick any user-specified hash algorithm at run time, and the ability to specify a /vector: .../ argument which supports interleaving for OSB et al.

'Flexibility' is key here, as I wanted something that would both be stable, robust and performing in production environments, while not restricting me when I want to test a new (hash-related) idea by forcing me to have to 'hack' the code.

The parts that make VT

The VT is a construct consisting of a few building blocks:

The VT output is the hash series produced by the mixer unit, which thus enables us to use hash values (a.k.a. 'features') which represent any desirable word sequences (Markov chains, etc.) instead of just simple words (unigrams). Quite a few tests have shown that using such 'combined' hash features produces much more accurate results than when you force the classifier (statistical filter) to glean meaning from individual word features without including any context (unigrams).

The implementation

The whole thing pivots around a 'class' definition written in portable C. Given the above mentioned set of basic blocks, it makes sense to treat each separately:

Setup & Cleanup

Since we have to provide sensible default values and functionality, yet wish to allow the user of the VT code to provide his/her own setup bits and pieces at will, the setup and cleanup code should be parceled off to separate functions; when regarding the above from a more Object Oriented way, rather than a Procedural approach, setup and cleanup can be considered something akin to a constructor and destructor:

    config_vt_tokenizer(...); // constructor
    virtual ~cleanup(); // d    virtual ~cleanup(); // destructor

Though 'C' does not offer such OO features, we can emulate these quite easily, using function pointers for virtual methods and offering basic functions as constructor equivalents: hence this turns into something akin to this in 'C':

// constructor: inits class instance 'dst'
// Return 0 on success, negative return values on error.
int config_vt_tokenizer(VT_USERDEF_TOKENIZER *dst, ...);   

    // destructor: call this when done; may be used to free() data when applicable.
    VT_tokenizer_cleanup_func *cleanup; 

Which requires we provide at least a default 'destructor', which, of course, is configured for the object instance by that constructor function. Our default destructor is:

static int default_regex_VT_tokenizer_cleanup_func(VT_USERDEF_TOKENIZER *obj);

Indeed, this can be a 'static' function as no-one needs access to it directly but the core functions working on the VT_USERDEF_TOKENIZER objects, and all those functions reside in the crm_vector_tokenizer.c source file.

Also note that we provide a type to fix the function/method interface, similar to the virtual function definition you would see in a language such as C++:

typedef int VT_tokenizer_cleanup_func (struct magical_VT_userdef_tokenizer *obj); // 'this' reference

where 'obj' is the object pointer of the instance that needs cleanup, which is exactly like the hidden 'this' reference your compiler would use when you'ld coded this in C++, Java or any other object oriented language.

Despite these, right now, this 'class' still does not perform any useful function, but we have our first building block: Setup & Cleanup, covered.

The Tokenizer

As our goal was maximum flexibility regarding tokenization and hashing, we can easily see why this should be a virtual method, resulting in a class definition like this:

    config_vt_tokenizer(...); // constructor

    virtual tokenize(input);

    virtual ~cleanup(); // destructor

which is translated into this 'C' code for the class definition:

    VT_tokenizer_func *tokenizer; // custom or default tokenize()

    // destructor: call this when done; may be used to free() data when applicable.
    VT_tokenizer_cleanup_func *cleanup; 
} VT_USERDEF_TOKENIZER;                               int tokenhash_dest_size);

where 'obj' is the 'this' reference to the VT_USERDEF_TOKENIZER instance and 'tokenhash_dest' is an output buffer of length 'tokenhash_dest_size' elements for storing the hashes generated by your tokenizer implementation.

Two things that need to be noted here:

Why did I construct the interface definition this way? Because:

  1. The source of the data to tokenize is unknown: it can be a file, a buffer in memory, etc.: this multiple choice in input types leads to the choice to specify the source/input data at the time the class instance is constructed, i.e. when we call the constructor to set up this instance; these parameters are then 'remembered' by storing them in the instance data:

    typedef struct magical_VT_userdef_tokenizer
        const char *input_text;                 // used to remember the input text to tokenize ...
        int         input_textlen;              // ... and its length in bytes
        int         input_next_offset;          // position where to continue grabbing the next token
        VT_tokenizer_func *tokenizer;           // tokenize()
        VT_tokenizer_cleanup_func *cleanup;     // virtual ~destructor()

    despite the fact that 'C' does not offer OO class inheritance features, we could extend the class/struct when needed; for now, the choice to do it this way is not really significant, though it does cut down the number of method interface parameters (and thus stack usage).

  2. The desired output of the tokenizer is well known and not subject to change: a series of hash features, one hash per word. As this result is to be piped directly into the next stage (the Mixer), not having that intermediate result buffered in the 'obj' instance itself is preferable, hence the extra output parameters for the tokenizer() method.

The default tokenizer() method

The default tokenizer() method, as configured by the constructor, is the function:

static int default_regex_VT_tokenizer_func(VT_USERDEF_TOKENIZER *obj,
        crmhash_t                                               *tokenhash_dest,
        int                                                      tokenhash_dest_size)

which handles both default CRM114 tokenization ([:graph:]+) and user-specified regex-based tokenization, just like the original code.

Note that this is another static function, as it is only to be used by the contructor code and otherwise always invoked through the obj->tokenizer() function pointer.

Note that this 'default tokenizer' also offers the extra feature, otherwise only found in the old OSBF code, where very large multi-line tokens (which are very probably MIME/UUenc encoded binary blobs, such as images, in email or NNTP news) are globbed into a single hash to prevent database polution for this type of entity.

Here, I divert from the original code, as the OSBF original code would treat the blob line sequence AB as identical to BA; the new default tokenizer ensures AB and BA blobs will produce two different hashes.

Of course, this feature can be turned on or off and otherwise configured: see the values in the class/struct definition:

int max_big_token_count;  // maximum number of 'big' tokens allowed to be merged into a single 'feature' a la OSBF
int max_token_length;     /* merge tokens longer than this with the next one into a single feature a la OSBF.
                           * Set to 0 to select the DEFAULT 'big token merge' tokenizer behaviour.
                           * Set to -1 to DISABLE any 'big token merge' tokenizer behaviour.
                           * Setting this value to a non-zero value helps 'gobble' otherwise undecipherable blocks
                           * of input such as base64 images. This 'merging/globbing' is done in the OSBF classifier
                           * among others.

OO tokenize() method versus 'C' obj->tokenizer() function

In C++, Java, etc. you can call a class/instance method like this:


which could be done in 'C' as well, given the fact that tokenizer() is a function pointer:

obj->tokenizer(obj, ...)

and that's where we have been rather unclean in this particular implementation to keep the appearance of backwards VT interface compatibility intact as much as possible: we still offer the old crm_vector_tokenize_selector() function call as the major interface for performing the tokenize and mixing operations. Internally though, crm_vector_tokenize_selector() does call obj->tokenizer() as one would expect, though this is off-loaded to another function called crm_vector_tokenize() to keep the code lean.

As such, our implementation assumes the 'public' interface is the procedural crm_vector_tokenize_selector() function call, rather than the object oriented obj->tokenizer(obj, ...) method invocation.

As we wish to keep the impact on the surrounding code to an absolute minimum, we have packed all four stages: Setup, Tokenization, Mixing and Cleanup, inside that single old crm_vector_tokenize_selector() call to ensure the ensure the calling code does not need to change beyond the removal of a few parameters, compared to what it was.

Hence the new crm_vector_tokenize_selector() interface reads like this:

int crm_vector_tokenize_selector(ARGPARSE_BLOCK *apb, // The args for this line of code
        VHT_CELL                               **vht,
        CSL_CELL                                *tdw,
        const char                              *text,           // input string (null-safe!)
        int                                      textlen,        //   how many bytes of input.
        int                                      start_offset,   //     start tokenizing at this byte.
        VT_USERDEF_TOKENIZER                    *user_tokenizer, // the parsing regex (might be ignored)
        VT_USERDEF_COEFF_MATRIX                 *user_coeff,     // the pipeline coefficient control array, etc.
        crmhash_t                               *features,       // where the output features go
        int                                      featureslen,    //   how many output features (max)
        int                                     *features_out    // how many feature-slots did we actually use up

which is several arguments less than the original, while only adding two new ones (in bold). Here we decided, for minimum impact on the calling code, that both 'tokenizer' and 'our_coeff' instances may be NULL (which they are throughout CRM114 right now), which will instruct crm_vector_tokenize() to create and fill in the defaults for both. This 'defaulting' includes the detection and processing of custom /vector:.../ coefficient matrices in the current crm114 script instruction (which is accessed through the usual globals, just like before. This is definitely not threadsafe, so not good enough for when libcrm114 arrives on the scene.

In order to handle the complexities of recognizing and decoding any possible custom /vector:.../ coefficient matrix, we have provided an additional constructor function config_vt_coeff_matrix_and_tokenizer(), which handles this before calling the default constructor config_vt_tokenizer(), filling in the (already decoded) custom coefficient matrix as an argument.

After this digression we return to the third process unit of the VT system:

The Mixer

As we have already achieved our desired flexibility in the tokenizer, and we also have any default or custom coefficient matrices available in the class instance, thanks to the decoding effort in the constructor (and its subfunction get_vt_vector_from_2nd_slasharg() and partners), this one can be rather static: there's no need for a virtual method for this one as the mixing stage is well defined and I don't see why I or others would like to change this core functionality any time soon. Hence we arrive at this class definition:

    config_vt_tokenizer(...); // constructor

    virtual tokenize(input);

    mixer(&output); // non-virtual method

    virtual ~cleanup(); // destructor

which translates to 'C' as 'mixer()' just being another function call which gets the 'this' (obj) instance pointer passed.

Here we've chosen not to provide an extra function, but integrate this important bit of functionality in the already present crm_vector_tokenize() function.

Performance considerations: burst mode tokenization and mixing

Performance is a factor, so the code should be as fast as possible.

The performance hit for using function pointers, compared to regular function calls, is negligible: the compiler only needs to one indirection (fetch function jump address) which is often only a single, fast, opcode on most CPUs (and at least on all Intel x86 hardware). I'm willing to bet pushing an extra 3 or 4 parameters on the stack per function call is more costly, so tracking all relevant input variables in the 'class' structure is not a bad choice - even while this implies a few more indirections in other bits of the VT code are required that cost a few extra cycles. However, these extra indirections do not happen within often executed loops; when they would, the intermediate results are stored in fast local variables.

Anyhow, if there's any relative big win to gain, it's most probably to be found in data and code cache locality improvements; to facilitate code cache locality at least, we have built in 'burst mode' processing of the data: the input is tokenized and hashed in 256-hash bursts (see the CRM_VT_BURST_SIZE #define); each burst means the obj->tokenizer() is only fired once for every burst, after which the produced hashes are mixed in burst mode as well.

To make that easier for use, we have reversed the order of the intermediate hashes, compared to the original code by Bill Yerazunis: newer items end up at higher indexes in the buffer. This means that for the coefficient matrix-based mixing to work, we will need to address the 'older' context features, to be mixed with the current tokenized feature, using negative indexes (as 'older' means 'lower index in the buffer'); this is a prime example of a valid use of 'C' ability to use negative indexes to arrays without any run-time penalty.

Compare the stripped-down old code versus the new. Old:

while (keepgoing)
    // Now slide the hashpipe up one slot, and stuff this new token
    // into the front of the pipeline
    memmove (& hashpipe[1], hashpipe, sizeof (hashpipe) - sizeof (hashpipe[0]));
    hashpipe[0] = strnhash(token.string);
    // Now, for each row in the coefficient array, we create a
    // feature.
    for (irow = 0; irow < pipe_iters; irow++)
        ihash = 0;
        for (icol = 0; icol < pipe_len; icol++)
          ihash += hashpipe[icol] * coeff_array[(pipe_len * irow) + icol];
          // Stuff the final ihash value into features array
          features[*features_out] = ihash;
          *features_out += features_stride;
        // And finally move on to the next place in the input.
        // Move to end of current token.
        text_offset += token.string.length;

and new GerH burst mode code:

Note the two major items here:

for (feature_pos = 0; ;)
    // cough up a fresh token burst now and put them at the front of the pipe, i.e. at the END of the array:
    i = tokenizer->tokenizer(tokenizer, &hashpipe[UNIFIED_WINDOW_LEN - 1], CRM_VT_BURST_SIZE);

    // position the 'real' hashpipe at the proper spot in the 'burst': the oldest token is the first one in the burst
    hp = &hashpipe[UNIFIED_WINDOW_LEN - 1];

    // Since we loaded the tokens in burst, we can now speed up the second part of the process due to
    // loop tightening here too; since the 'reversed' pipeline stores newer tokens at higher indexes,
    // walking the sliding window (a.k.a. 'hp' pipeline) UP from old to new, means we can reposition the
    // 'hp' pipeline reference to point at the NEXT token produced during the burst tokenization before,
    // while keeping track how many tokens were produced by reducing the burst length 'i' --> i--, hp++
    for ( ; i > 0; i--, hp++)
        // remember how far we filled the features bucket already;
        // we're going to interleave another block:
        int basepos = feature_pos;

        coeff_array = our_coeff->coeff_array;

        // select the proper 2D matrix plane from the 3D matrix cube on every round:
        // one full round per available coefficient matrix
        for (matrix = 0;
             matrix < pipe_matrices;
             matrix++, coeff_array += pipe_iters * pipe_len)
            // reset the index for interleaved feature storage:
            feature_pos = basepos + matrix;

            // Features for the different 'strides' are stored in interleaved format, e.g.
            // for stride=2 Markovian:
            //   A
            //   B
            //   A
            //   B
            //   A
            //   B
            //   .
            //   .
            //   .

            //    Now, for each row in the coefficient array, we create a
            //   feature.
            for (irow = 0; irow < pipe_iters; irow++)
                register const int *ca = &coeff_array[pipe_len * irow];
                register crmhash_t ihash = 0;

                // WARNING: remember that we use a 'reversed' pipeline here, so higher indexes are newer;
                // as we have positive offsets in ca[] pointing at presumably higher=OLDER tokens, we must
                // negate them (index * -1) to point at the intended, OLDER, token in the hashpipe pointed at
                // by 'hp' -- a valid case for 'negative indexing' in what seems to be an array but is a
                // positional pointer ('hp') pointing at the storage END of the pipeline.
                for (icol = 0; icol < pipe_len; icol++)
                    register crmhash_t universal_multiplier = ca[icol];

                    // the - minus 'icol' is intentional: reverse d pipeline: older is lower index.
                    ihash += hp[-icol] * universal_multiplier;

                //    Stuff the final ihash value into features array -- iff we still got room for it
                features_buffer[feature_pos] = ihash;

                feature_pos += element_step_size;
        // and correct for the interleaving we just did:
        feature_pos -= element_step_size - 1;     // feature_pos points 'element_step_size' elems PAST the last write; should be only +1(ONE)


The 'register' modifiers in there are just an extra reminder to the compiler optimizer that I really would like those inner loops to only use register-based variables. The smartest optimizers don't need the hint, but it doesn't hurt either - if you know what you are doing.

In closing...

The above is a write-up, done in a few hours, to clarify the [technical] design thoughts behind and the actual implementation of those ideas in the new GerH VT code.

You can implement rather clean and robust code using some basic OO-like techniques while programming in the [procedural] 'C' language: function pointers have an extremely low overhead and can, when used properly, mimic virtual member functions; furthermore, the separation between data and function does not have to be, even when programming in a language which is largely considered to be procedural in nature. Surely, it requires a bit of premeditation, but exercising your brain before grabbing a keyboard never hurts.

Of course, this approach is akin to being your own Poor Man's cfront, but I like what can be done this way in such an 'old' language, while keeping readable code.

As an added bonus to frustrate the Code Calvinists out there, here's a prime example of why negative array indexing is Good, instead of Evil.