CRM114 for Win32/UNIX - VT 2007 (Vector Tokenizer)

This comes (almost) verbatim from an email by Bill Yerazunis on the crm114 mailing list.

Note that there are some mathematical details in there which I disagree with; see the notes at the bottom.

VT 2007: Vector Tokenization

(from email in CRM114 mailing list)

Summary

Vector tokenization (VT) is a generalization of the MRF / OSB / OSBF system; it lets you specify the way regexed tokens are to be combined into features. It lets you control how features are assembled without having to hack the C code. It's not a new idea (see "A Unified Approach to Spam Filtration").

By default, with VT you get the standard tokenization we've come to know and love. But you can switch tokenizations now independently of the classifier algorithms, just like the /slashregex/ lets you specify tokenization on the fly.

The VT code auto-defaults depending on the classifier: if you don't specify anything different, you get what you got before (modulo edge effects on how the first and last few tokens in a text are treated).

But it also means that any VTed classifier can use any other token patterns, like

UNIGRAM

one token equals one feature.

STRING

four tokens in a row equals one feature, default regex here is /./ (that is, one byte = one token)

You can even specify (in the second slash parameter) the vectors that translate token streams to feature streams. You do this with the "vector:" command in the second slash parameter of the classify.

Read on further if you want gory details.

Bill Yerazunis

The Gory Details - Skip This if you Don't Care

The second-slash format for vector tokenization is

    / vector: PipeLen PipeIters val1 val2 val3 val4 .... /

"Pipelen" is an integer to specify how many tokens in a row might be included in the token-to-feature translation.

"PipeIters" is an integer to specify how many different such weightings are in the vector.

val1 through valN are the actual pipeline coding values. One pipeline set is PipeLen integers, and there need to be PipeIters such sets (if you don't have enough, the missing ones are taken as zeroes, and if you have too many, the excess ones are ignored).

It is suggested but not required that the pipe valN values be small odd prime integers;

Some examples (extra spaces are disregarded; I'm adding them to make it visually more intuitive):

  vector: 1   1   1  

means the pipeline is 1 regexed token long, there is only one feature to generated from the pipeline position, and that feature is to be treated uniquely (kinda redundant, I know). This is what we would normally call a UNIGRAM feature, as one regexed token becomes one feature.

So we wanted pairs

Say we wanted word pairs (something you can't get in current CRM114 without using preprocessing). You could do that with:

  vector: 2  1   1 2  

Here, the pipe is 2 tokens long, there is 1 feature per pipeline position, and the one output feature is to take pipe position 1 and pipe position 2 independently (that is, if the word pairs are in a different order, then it's a different feature.

Sharing word positions: no order, no discipline

Now, let's try sharing word positions. Here's word pairs where order is disregarded; "foo bar" is the same feature as "bar foo":

  vector: 2  1   1 1  

that is, a 2-long pipe, 1 feature generated from each pipeline position, both positions count, and the position values are not differentiated.

Sparse Bigrams: The Matrix Is Born

Here's one for the first two terms of sparse bigrams:

  vector: 3  2    1 2 0   1 0 3  

This makes a 3 token long pipeline; there are two features to be generated at each pipeline position. The first feature (the "1 2 0") uses position 1 and position 2 (and keeps them discriminated; switching 1 and 2 around gives a different feature). The second feature (the "1 0 3") uses only position 1 and 3 of the pipeline in that order, and disregards position 2.

Now have a taste of this!

Now try this:

  vector: 3   2   1 2 0   1 0 2   

This means the pipeline is three tokens long; there are two feature vectorsets following; the first feature output is token 1 in the first place and token 2 in the second and disregard token 3. The second feature will have token 1 in the first place, token three in the second place, and will disregard token 2. Note that this means that the same word appearing in position 2 of the pipeline and position 3 of the pipeline WILL GENERATE THE SAME FEATURE†.


 

 


†) [GerH:] that is: they will generate the same feature hash value. Take, for instance the input token stream (bye)1+(bye)2+(love). Given the example 3 × 2 matrix example here, it means these feature hashes are produced (and don't forget the front column is the most recent token, any subsequent columns in those VT matrices point back into history!)

#1: (love)+(bye)2+0

#2: (love)+0+(buy)1

which results in #2 being identical to #1. Hence: 'generate the same feature'.

Further note that those numbers in these VT matrices are multipliers employed in a universal hash scheme, so using any even numbers for those values is Very Bad Form™. Preferably, you should employ numbers which are relative primes of 232 and otherwise heed the advice from Knuth (The Art of Computer Programming) and others (here, here, here) regarding these multiplier values.

Hence you should take the line up there stating:

It is suggested but not required that the pipe valN values be small odd prime integers

with a large grain of salt. (In other words: I strongly disagree. Those numbers should be odd at the very least; see here for further considerations in determining suitable numbers for these.)