CRM114 for Win32/UNIX - crm114 Bayesian classifiers and the h1, h2 feature hash values

New stuff being served...

Why GerH builds have a 'markovian.alt' and other '.alt' classifiers: those are R&D classifiers using the latest VT technology and classifier optimizations described here.

Ger Hobbelt

Note that this space shows the state of the art as per 7 february 2009. Any source code in this article shows the state as per that date (or earlier); their value is purely educational. Fetch a GerH build if you like to have this stuff in a usable form.

Table of Contents

  1. TBD

crm114 Bayesian classifiers and the h1, h2 feature hash values

crm114 Bayesian classifiers (Markovian, OSB, OSBF and Winnow) are all feature hash based. So far, no worries.

But when we look a little at the implementation, you'll note that those feature hashes actually are TWO(2) hashes. h1 and h2. How are those produced from the 'tokens' ('words') we mentioned in another article?

The code takes the tokens and hashes each of them to a 32-bit hash value. So far, so simple.

But the h1/h2 things happens at the spot where those individual token hashes are combined into a final feature hash: recall from the other article that multiple tokens are combined into a Markov token chain, which will produce -- or so we said -- one feature hash. And we didn't lie. h1/h2 may be two different hash values, but together, they form the feature hash as seen by the classifiers.

A bit of tech first

The individual token hash values are mixed into feature hash values using a universal hashing scheme (see Knuth and others), which basically takes a token hash value, multiplies it with a odd (and preferable 'relatively prime' to 2^32) number and adds this result to the feature hash value (which started by being initialized to zero). So far nothing special: a universal hash scheme is used to mix the token hashes in to a feature hash, which can be used to locate 'hits' in the CSS databases.

Yet all Bayesian classifiers employ two such hash values!

Each of these two hash values is calculated using the same mechanism (the universal hash), use the same selected token hash values, yet have applied different universal hash multipliers!

Paraphrasing the crm114 pseudocode
for (i = 0; i < lengthof(hashpipe); i++)
  32bithash h1 = 1 * hashpipe[i] + 5 * hashpipe[i-1];
  32bithash h2 = 3 * hashpipe[i] + 7 * hashpipe[i-1];

This results in two different hash values, where the second one is referred to as a 'cross cut'. A 'hit' is scored when both hashes match the hashes stored inside the CSS record found in the linear probe hashtable, using a h1 derived index value as a probe.

Paraphrasing the crm114 pseudocode
idx = h1 % lengthof(css_hashtqable);
if (css_hashtable[idx].hash == h1
    && css_hashtable[idx].key == h2)

When I saw this before, I didn't think too much about it and just assumed this was an attempt to create something akin to a 64-bit feature hash value in a time when 64-bit integers were not available through the usual compilers on the usual platforms. (They are today, however.)

A closer look though...

But the VT move led to another, closer look at these h1,h2 numbers.

Now here's the thought: since h1 and h2 are both derived off the same set of inputs (token hashes) and only differ in the universal hash multipliers applied to them, these two hashes would only 'expand' the bit space of the combine beyond that of a single hash calculated this way when, and this is important, bit avalanche of the individual hashes would be considered suboptimal! (Which is indeed an, at least theoretic, deficit in the universal hash scheme: the multiplication will prevent higher bits in the inputs from propagating into the lower bits of the result, thus offering poor avalanche behaviour.

However, I think creating two universal hash values like this does not solve this issue -- iff it is an issue for this type of input, which is a relatively rare situation. Alas, hard numbers are missing to 'prove' this.

The point?

The second calculated hash value h2 is superfluous: it does not present additional 'entropy' derived from the input - as the exact same input is used for both - so, iff the hashing system is disputable another, proper, solution should be found (and, personally I don't think it is not, at least not this hash system. GerH builds offer different token hash algorithms as alternatives for the original, because I believe that hash (which is entirely different code) is far more important to attaining good avalanche and overall hash distribution then this universal hash post-op which is only used to combine a (small!) set of token hashes into a feature hash. The biggest influence to hash distribution is situated further up front in the whole process: at the token hash function (strnhash()).

Thus follows that the code should be able to perform as well with h1 only, discarding all effort related to h2.

And when this is shown to not be true, it may mean either the analysis above is severely flawed, or the universal hash scheme employed at the token hash -> feature hash conversion level is inadequate. In case of the latter, a reasonable solution is to employ a different hash scheme at this level, though this will have significant impact on the VT mix matrices, which implicitly assume the use of the universal hash algorithm at this upper level.

Another inroad to an 'improved feature hash calculation', if such is required, is to upgrade the hash stages from 32-bit to 64-bit hashes. As then the token hash algorithms will produce 64 bits hashes, the added information inherently present in these can still be mixed using the universal hash to produce a single (now also 64 bit) feature hash for h1. Still we can do without h2 then, and the 64-bit hashes would occupy the same storage space in the CSS databases as the current dual h1,h2 32-bit hash scheme, though at an (expected) improved accuracy.

That last bit applies only for very large input sets, i.e. where hash value collisions within a 32-bit space are inevitable. It does not apply for very hard to hash raw input material, i.e. raw input material where the 32 bit token hash creates a lot of collisions. There, selecting an alternative token hash (strnhash()) function is far more beneficial as that's the first place to look when you wish to have a better hash distribution for your particular input.


Currently, there is no evidence a 64-bit / alternative top-level hash system move is needed, so I might safely assume I can simply discard the h2 32-bit hash and remain with a little simpler, yet equally performing code for the Bayesian classifiers.