CRM114 for Win32/UNIX - VT (Vector Tokenization) for Bayesian classifiers in crm114

New stuff being served...

Why GerH builds have a 'markovian.alt' and other '.alt' classifiers: those are R&D classifiers using the latest VT technology and classifier optimizations described here.

Ger Hobbelt

Note that this space shows the state of the art as per 7 february 2009. Any source code in this article shows the state as per that date (or earlier); their value is purely educational. Fetch a GerH build if you like to have this stuff in a usable form.

Table of Contents

  1. TBD

VT (Vector Tokenization) for Bayesian classifiers in crm114

crm114 Bayesian classifiers (Markovian, OSB, OSBF and Winnow) are all feature hash based. So far, no worries.

Bayesians go VT

As some technical details were pondered in other articles (the stuff about those h1 and h2 hashes, and the Markovian specialty) you might wonder what the new VT is going to look like - and what it does, under the hood. After all, those articles clearly hinted a New And Improved VT was under way to a release point.

You may wish to read up on the previous GerH VT engine version, as the new one is an augmented version of that.

The new VT - What You See

Folks who have dug deep into the plumbing of SKM, SVM and Hyperspace classifiers will have noticed the existence of a separate code: the VT, a.k.a. the Vector Tokenizer. The thing which I refer to as the VT engine. As it is a separate building block, resulting from some fine ideas from Bill and Joe Langeway, while they abstracted out the whole 'convert input to feature hash series' activity from the classifiers themselves. Good Move.

The vanilla VT isn't all that visible, at least not from a script perspective. The only bit which might have made you take notice is this:

Bill:

learn|classify ... / vector: 3 1 1 1 1 /

GerH:

learn|classify ... /vector: 3 1 1 1 1 1 /

(See the VT 2007 description for Bill's)

The above represents a very basic VT matrix:

1 1 1

where the single row represents one feature hash to be constructed from the input tokens, where three consecutive input token hashes (a) (b) (c) will be converted into one feature hash value using the universal hash paradigm (see Knuth, The Art of Computer Programming) as {1 × (a) + 1 × (b) + 1  × (c)}. In hopefully more understandable words: this means the single feature hash to be generated per input token ('word') mixes the current token (front row) with the two immediate preceding tokens (words) in a position-independent manner (each token hash has the same multiplier).

Thus (beam)+(me)+(up) will have the same feature hash value as (up)+(me)+(beam), meaning both those words sequences are completely indiscernible from each other from the point of view of the classifier being applied to them.

Hence, the default (built-in) VT matrices look more like this:

1 3 5

i.e. three consecutive input tokens (words) are mixed in a position-dependent manner: here (beam)+(me)+(up) will definitely produce a hash quite different from the one for (up)+(me)+(beam), thus allowing the classifiers to recognize each.

Why has GerH an extra value in the vector: spec?

Because GerH's VT code already recognized some issues which would come up when VT was, as planned, introduced to the Bayesian classifiers in the crm114 collective (Markovian, OSB, OSBF, Winnow), as those employed a 'cross-cut' dual hash scheme, requiring two different VT matrices for each of them. Which led to the rapid conclusion we didn't need a 2D VT matrix, but rather a 3D VT cube. Which is just the hype-worded equivalent of 'a set of 2D matrices'.

BillY's code also included provisions for this future happening, yet omitted it from the vector: interface definition, thus making it impossible to define custom 3D cubes. While the code does have built-in 3D cubes.

Thus vanilla/BillY vector: spec is

/ vector: col# row# matrix-data... /

where GerH's spec is:

/ vector: col# row# depth# matrix-data... /

That explains the extra '1' in GerH's up there in the example.

Both specify the matrix-data the same way: row after row of columns of numbers. Where GerH's extended it like this to support depths larger than one: once the first 2D matrix (of size cols × rows) has been listed, continue with the data for second matrix (assumed to have the same size), and so on.

So far the current state of affair, before the GerH BlameBarack release...

What'd you got?

The new, augmented GerH VT argument spec takes all the concerns we've discussed before into account:

/ vector: col# row# depth# matrix-data...
  weight: col# row# feature-weight-data...
  vector-clip: col# row# depth#
  weight-clip: col# row#
  weight-model: (flat | osb | chi2 | entropic | markov | super_markov | bcs | bcs_mws | bcs_exp | bcs_b7 | osbf)
  vector-model: (osb | sbph)
/

*) 'bcs' stands for Breyer/Chhabra/Siefkes(/Yerazunis) [Wikipedia] [CiteSeer].