CRM114 for Win32/UNIX - alt.* Bayesian/Markovian classifiers in crm114: future plans

New stuff being served...

Why GerH builds have a 'alt.markovian' and other 'alt.*' classifiers: those are R&D classifiers using the latest VT technology and classifier optimizations described here.

Ger Hobbelt

Note that this space shows the state of the art as per 22 february 2009. Any source code in this article shows the state as per that date (or earlier); their value is purely educational. Fetch a GerH build if you like to have this stuff in a usable form.

Table of Contents

  1. TBD

Anti-aging for Bayesian classifiers in crm114

crm114 Bayesian classifiers (Markovian, OSB, OSBF and Winnow) are all feature hash based. So far, no worries. All of these classifiers support the 'microgroom' option, which is meant to prune the CSS feature database (a feature hash table) by discarding old entries.

Speed is what I need

The vanilla microgroom implementation is built on top of the hash table algorithm used for the markovian, osb, osbf and winnow classifiers: a linear probe hash table.

The fact that a linear probe hashtable is used, implies that new features are appended to the chain, starting at the calculated hash index. That means the oldest entries are located at the front of the chain.

This is exactly what happens. And it has two effects: When a classification run is executed ('classify'), the features trained at a more recent date are at the end of the linear probe chains (which can become quite large once a CSS database gets filled with training data over time), thus those are 'hit' (found) only after the longer part of the chain has been traversed. This of course assumes new email will match rather more recent features than old ones.

The second effect of the linear probe / append methodology is that microgroom, who is tasked with discarding old entries, has to discard the front part of the chain, which results in a costly operation, as the chain has to be regenerated from scratch: linear probe chains intermix, so discarding entries at the front of the chain means the entire chain is lost, unless it is regenerated by re-inserting the features in the chain into the hash table, thus recreating the linear probe chain. Note that chains can and do get mixed together, so discarding entries in a rather filled linear probe hash tree takes a bit of work to repair as multiple, partial, chains have to be fixed.

Of course, a speed improvement here would be to take the last entries in the chain and move them to the front, but then the whole idea of 'later in chain equals more recent' is blown to smithereens, so this is not advised, given 'microgroom's assumption that 'front is old'.

Hence one might look outside the realm of linear probes for solutions which might speed up both the classification process (assuming more recent features are sought more often) and the microgroom process (when we can have the oldest entries at the tail of a chain of any sort, we might just get away with discarding that tail without the need for any repair work).

Enter cuckoo hash tables

These have a few characteristics we may use to our benefit: first, cuckoo hash tables store recent entries at the front of the probe chain (by 'pushing away' entries which previously occupied this space), which means more recent features will be found faster in the classify process, plus the fact that, now, 'oldest is last', might mean we can just discard the oldest entries during a microgroom and no restructuring work needed. Since cuckoo hash tables use a quite different chaining scheme, the risk of mixing chains a la linear probe is nil, which leads me to expect zero (or little) restoration work required when microgrooming in cuckoo hash table based feature databases. (With linear probe hash tables, once one chain's tail touched the front of another chain, both become one and are indiscernible from each other, starting at the head of the second chain. With cuckoo hash tables, chains can intertwine, but only for one (or a few) entries as different step rates are used.)

Classify performance estimate

I expect classify to perform as well or better for cuckoo hashing as it does for linear probe hashing: linear probe hashing has the advantage that, when the fill rate of the database is very low, chain lengths are less than 2, that is: one hash entry filled, the next not filled: length = 1. With two consecutive hash entries filled, you already have a situation indiscernible from a linear probe chain of length = 2.

In other words: linear probe hashing requires only a single hash index in optimum conditions.

The drawback of linear probe hash table storage is that it's performance deteriorates exponentially with fill rate. The common rule of thumb is to never go beyond a fill rate of 50%, but be reminder this is for flat hash distributions, which are not guaranteed to occur foe every input / use.

On the other hand, n-bucket cuckoo hash tables requires n hash indexes in any conditions. For basic cuckoo hashing, you'll need 2 hashes, which puts it at a disadvantage compared to a very lightly filled linear probe hash table.

Nevertheless, cuckoo hash tables are far more resilient to table fill rate: their performance only decreases notably at a much higher fill rate: depending on the type of cuckoo hash, this number is somewhere between 80 and 96% fill: quite a difference with a linear probe hash table which, by then, is nearing sequential indexing cost (O(N) instead of O(1)).

Surely cuckoo hashing is not the golden egg in hashing technology, so it too does suffer from chain length performance degradation, but this can be alleviated through using n hashes and/or m buckets: the original cuckoo hash had a bucket size of 1, but no-one is restricted to this: for example, 2-hash, 2-slot bucket cuckoo hash systems are very efficient, also when fill rate is rather high.

Training performance estimate

As training is mostly adding features to the hash table (which includes one chain scan per feature hash plus an addition to that chain) I expect performance to be comparable: for linear probe hashing, it's one full chain traversal plus one append, while with cuckoo hashing it's one insert at the front and 'push back' along the entire chain.

The major difference here is that linear probe hash systems get their chains merged over time, resulting in a new chain length which is the total length of both previous chains, while cuckoo hashing, thanks to its hash-based index step rate, will 'touch' different chains, but the new chain length will not be the sum of both; the new chain length will just increase by one(1) entry as chains collide. Ergo: we can expect shorter probe chains with cuckoo hashing.

The second part of training is performing the 'microgroom' operation: since this operation is expected to remove the oldest entries from the probe chain, such an activity would be quite fast for cuckoo hashing (clip the end; some minimal restoration is required for when hash chains have collided in there) while it takes quite a bit of effort in the linear probe hashing scenario as the front entries are removed and the entire chain has to be restructured as entries are re-entered into the hash table in order to create new, valid, probe chains.

All in all, I don't expect much difference; may a slight win, but not much.

After all, the major benefits are expected to show for classification runs in well filled CSS databases.

Disc performance / system cache considerations

As crm114 uses memory-mapped hash tables as its storage unit, cuckoo hash tables can be used here quite nicely. The dual hash index cost can be reduced by ensuring both indexes end up in the same memory map/disc page, thus resulting in system cache hit/miss performance almost identical to linear probe hashing.

Increasing the number of slots per cuckoo bucket also may improve disc cache hit/miss performance, though at the cost of an increased number of probes, especially when those buckets are treated as linear chains instead of 'mini' cuckoo hash tables (one can use additional hash functions to derive indexes within those (larger) buckets to improve in memory hit/miss performance over linear probe within the bucket).

Speed is what I need

The above describes the idea; the alt.markovian / osb / osbf / winnow classifier code will be used to test these hypotheses.

Further notes

Another alternative considered is HAMT (Hash Array Mapped Tries), which allows for growing hash stores. But this is a far way idea as the benefits are not all that clear, when compared to cuckoo hashing.