CRM114 for Win32/UNIX - Migrating vanilla ('long' int) FSCM code to portable int/uint32_t/... code.

Here is the write-up (rudimentary) of the migration of the latest published vanilla FSCM source code file to portable use of int, uint32_t, etc. basic types.

New stuff being served...

This is mostly a copy of the message posted to the CRM114 mailing list, but this one comes with the files included for direct viewing and downloading, plus a few additions of its own, as time permits. New stuff will be tagged with the little served image at left.

Have fun with this work in progress!

Ger Hobbelt

Note that this space shows the state of the art as per 19 june 2008. The source code files are NOT updated after that and stand on their own; their value is purely educational. Fetch a GerH build if you like to have this stuff in a usable form.

Table of Contents

  1. The files
    1. Compressed download:
    2. Direct view:
  2. So what was the approach to get to this result?
    1. First phase: simply find & replace all 'long' with 'int'
    2. Second phase: hashes fixing
    3. Third phase: check code for system calls without error checks
    4. Fourth phase: try a compile
    5. Fifth phase: fix up all the printf() lines out there
  3. In closing...

Intro

Okay, like promised: the FSCM code migrated to 'int' + [u]int32_t and all with (nearly) no warnings; see previous message of mine today; other subject title with that one.

NOTE: this is not a finished job! There's still a few little ha's and hm's in the code that I want to look at, but that's got NOTHING to do with the int/uint32_t conversion we are talking about here. Besides, time is a factor.

The files

Compressed download:

New stuff being served...

Download the 7zip archive with all the necessary goodies. The same, but now as .tar.gz archive for default UNIX installs. Note that these archives lack one file: the original WGET code from Bill. It is available separately right here in compressed tar.gz format.

Direct view:

The result: crm_fast_substring_compression.c

... is the new FSCM int/uint32_t code in what I call 'backported' state: it compiles when thrown into a directory with the other code straight from WGET ('vanilla CRM114'), compiles and runs. At least on my boxes. :-)

Since I've by now spent about half a year of work on this and some things started to irritate enough halfway down that path, that's why you get ...

uncrustify.cfg

... which is a config file for the open source code cleanup tool called 'uncrustify'. See google/sourceforge. I use it because then I can run all code through it from time to time so it stays readable for me, even after some extremely savage sessions. The very few bits of odd formatting are quickly coped with. Besides, my diff tools can cope with the changes that result from this, so it does impact my migration/checks with WGET copies only very little, so there's really no downside to this.

Then we have ...

crm_fast_substring_compression.c.GerH.rough-migrate

which is a copy of the int/uint32_t migrated source code, right after the migration was done. No cleanup, nothing. Just edits to move from 'long' et al to something that's at least adhering to my minimum quality requirements. Or close enough. When you compare, you will see that those requirements include full error checking for I/O at least and other code/calls whenever possible.

TIP: Make writing stuff like

if (n != fwrite(buf, size_of, n, of))
{
  barf("yo! out of disc space or something! Go nuts!\n");
}

a habit instead of just

fwrite(buf, size_of, n, of);

and you're halfway there already.

crm_fast_substring_compression.c.GerH.rough-migrate

still has not fixed fringe issues in the CSS file extraction from the script line, but I'll do that later. Does not apply when you spec two CSS files as usual, so don't bother for now. It's just that the MAX_CLASSIFIERS check is odd and could be done in a (IMO) easier reviewable/checkable way with no risk.

On to the last in the set:

crm_fast_substring_compression.c.GerH.uncrustified

... which is the final result for today: rough-migrate pulled through uncrustify. How do you like it? Better readable, eh?

So what was the approach to get to this result?

First off, quick code inspection.

Of course, I can grin and say 'global find & replace (long) --> (int)' but that would land you exactly where you don't want to be, because you may lose info in the process: int stays int and maybe some of those 'long's were meant to be something else. Better see some of that before you pull the trigger on El Lupo.

//  crm_fast_substring_compression.c  - CRM114 - version v1.0
...
int crm_fast_substring_learn (CSL_CELL *csl, ARGPARSE_BLOCK *apb, 
			      char *txtptr, long txtstart, long txtlen)
{
  ...
  //        unk_hashes is tempbuf, but casting-aliased to FSCM chains
  int unk_hashcount;
  crmhash_t *unk_hashes;
  unk_hashes = (crmhash_t *) tempbuf;
...
	//   Now it's time to generate the actual string hashes.
	//   By default (no regex) it's a string kernel, length 3,
	//   but it can be any prefix one desires.
	//
	//   Generate the hashes.
	crm_vector_tokenize_selector
	  (apb,                   // the APB 
	   txtptr,                 // intput string
	   txtlen,                 // how many bytes
	   txtstart,               // starting offset
	   NULL,                  // custom tokenizer
	   NULL,                   // custom matrix
	   unk_hashes, // where to put the hashed results
	   data_window_size / sizeof (unk_hashes[0]), //  max number of hashes
	   &unk_hashcount);             // how many hashes we actually got    
...

Inspection identifies in learn and classify code that 'unk_hashes' is a prime candidate for 32-bit strict width as it stores hashes. (And hashes are very sensitive to (different) storage bit-sizes.

typedef struct {
  uint32_t index;
} FSCM_PREFIX_TABLE_CELL;

typedef struct {
  uint32_t next;
} FSCM_HASH_CHAIN_CELL;

Then there are a few tiny structures at the top, which sound like they _might_ also appear on disc ('persistent storage'). A bit of search reveals this assumption to be true. Okay, so we know we got to watch our six now, but we got a job to do.

First phase: simply find & replace all 'long' with 'int'

Simple.

Second phase: hashes fixing

 

As you'll see in any GerH distro, I defined a typeof custom type for the hashes. Just to make sure I can track where 'hash' is still hash and where it is NOT. Hence we have:

typedef uint32_t crmhash_t;

and crmhash_t is what we gonna use. Do not, repeat not, conclude this shows you need bloody custom typedef's for bloody everything. I have had special heavy Jack Daniels medication for when I was working with code projects that were based on that beautiful 'conclusion'.

Use custom types with MUCH restraint. CRM114-GerH has two of them:

 crmhash_t

and

 hitcount_t

which is a typedef replacing 'long long / int64_t' for feature hit counters in production classifiers. Bill, that means hitcount_t is my answer to that 'long long' item we were talking about. Compiler aint got 'long long'? Fine, then it's got to have int64_t or some such as a replacement; if not, we can always yammer and revert to 'long' with a bit of whining to the user. CRM114-GerH does not (yet?) support machines which do not have 'long long' so when I apparently don't worry too much about it, why should you?

Back to our 'phase 2': hash fixing.

Track down all the DECLARATIONS of 'unk_hashes' and replace that 'unsigned int' there with a nice 'crmhash_t'.

For added bonus points, grep for unk_hashes usage: where it is passed as-is (or an element from this array is passed as-is) to a function, check function prototype and make sure prototype has crmhash_t as argument type for that parameter.

This means a '%' modulo operation does not 'propagate' the 'uint32_t'-ness of the crmhash_t: because the intent of that '%' is to convert the hash into an array index in a hash table.

Since we are basically 32-bit memory system bound (for the lowest range of platforms), I argue that array indexes are permitted to be 'int' without any harm; 'unsigned' may go away - too much typing, too much to read too, when scanning source code. 'unsigned' is a very strong signal to my brain, so I tend to use it only when it must be there. 'int' has one(1) power of two less range in the positive than the maximum possible for any system, 32, 64 or whatever, so using such an int to index arrays with elements of size 2 or more means the 'int' is quite sufficient to address the complete theoretically available address space. Hence: no need for unsigned.

I was rather in a hurry today, so I did not clean up the 'unsigned's in there where I think they can go, but be assured there's a lot of them can can be discarded without ill results.

Third phase: check code for system calls without error checks

That's fopen/fwrite/the bloody lot. Copy&paste + edit error check code from other files for quick fix of this: any fwrite() may more or less silently fail to write all elements: hence the mandatory check for the number of items it reports back as actually written. Search the original FSCM WGET code file and the new one for these edits. After one or two, you'll definitely see the pattern emerge:

//    Write out the initial file header..
hashf = fopen(hashfilename, "wb+");
if (!hashf)
{
    fev = nonfatalerror_ex(SRC_LOC(),
            "\n Couldn't open your new CSS file %s for writing; errno=%d(%s)\n",
            hashfilename,
            errno,
            errno_descr(errno));
    return fev;
}
else
{
    if (1 != fwrite(&my_header,
                sizeof(STATISTICS_FILE_HEADER_STRUCT),
                1,
                hashf))
    {
        fev = nonfatalerror_ex(SRC_LOC(),
                "\n Couldn't write header to file %s; errno=%d(%s)\n",
                hashfilename, errno, errno_descr(errno));
        fclose(hashf);
        return fev;
    }
    fclose(hashf);
}

Fourth phase: try a compile

Don't sweat over warnings, just the errors. Make it compile: inspect code around error lines and fix the code. Applicable here as part of my migration process because I know some functions in crm114 have changed interface, for example the VT (Vector Tokenization) calls: GerH allows for custom tokenizers (not necessarily regex-based) and such, so the function has a few parameters changed.

Fix that, but do inspect the code before you go and kill the old code. My luck Bill had all 'unused' args in there so migration was very fast.

Fifth phase: fix up all the printf() lines out there

This requires a bit of brain sometimes, hence a separate phase.

The idea here is simple: we only use C89-style printf() to ensure the widest possible compatibility/compilability of our code. That means we need to think up something for our uint32_t's here.

I gave the answer already in last night's email, so here's the short version of it:

Typecast every [u]int[8|16|32|64]_t to a C89 type which you know will be as large or larger. Do not care about unsigned/signed of the casted result, because signed/unsigned only matters for the int[xx]_t type where you came from: if that one has the wrong sign treatment, you're screwed anyway, and much more then than only at a few measly printf() lines: you calculations will suck mud.

Here's a little lookup table for those typecasts: X marks the spot where typecast is Good; . is Ba-a-a-a-d.

to: %c
%02x
char
%d/%u
 %04x/%08x
int
%ld/%lu
%08lx
long
%lld/%llu
%016llx
long long
from:        
[u]int8_t X X X X
[u]int16_t . X X X
[u]int32_t . X X X
[u]int64_t . . . X

The table is valid for any 32-bit and higher (64-bit, ...) system out there (note that C89 'short' is not listed: I only list types important for printf() usage here). The ever wicked item there is the 'long long' (and of course the 'int64_t': if your compiler cannot handle 'long long', you're in for much rejoice anyhow, and as I said many times: 'long long' availability is a prerequisite for CRM114. When you have CSS files with a lot of stuff in them, hitcounts, etc. may overflow 31-bit (signed 32-bit) counters, especially when you start to weight them or otherwise and don't do so in floating point. Anyway, there's very few non-longlong supporting boxes out there these days, at least not in the computer space where CRM114 may wish to exist. So screw that, 'long long' is fine and here we go...

Given the truth table above, it's very easy to see what needs to be cast to what.

NOTE: For 'safety's sake I would like to stress that 'int32_t' should be cast to 'long' instead of the equally valid 'int', but that's just some letfover epileptic fit I get when having to port 32-bit code into the 16-bit realm. Happens very seldom these days, but it has happened to me in sufficient amounts that I still tend to prevent disaster for when it comes. 'long' on a 16-bit box is always 32-bit, even on all 8-bit MCU's I've met, hence my tendency to say 'int32_t goes into long'.

Anyway, since we already did shoot our sawed off shotgun (El Lupo) and axed every long out there into an int, phase 4 (try a compile) will already have given you a shitload of hints where you might want to have a peek at printf()'s and all, especially when you have such a nice cooperative compiler like GCC who does additional vararg checking for you (if you help it along a bit - see __attribute__() for the interested).

Bill's code has only very few spots where hashes are printed and where he does, I generally tend to 'rewrite' it these days when I see another line where 8 or more hashes are dumped.

Here's the code:

if (internal_trace)
{
    int display_count = CRM_MIN(16, unk_hashcount);
    fprintf(stderr, "Total %d hashes - first %d values:", unk_hashcount, display_count);
    for (i = 0; i < display_count; i++)
    {
    	if (i % 8 == 0)
    	{
    		fprintf(stderr, "\n#%d..%d:", i, CRM_MIN(i + 7, display_count - 1));
    	}
    	fprintf(stderr, " %08lx", (unsigned long int)unk_hashes[i]);
    }
    fprintf(stderr, "\n");
}
    

Notice that I do not like to print hashes as decimal integers (%lu or %ld: %lu is better in this case, BTW) but as 8-byte, zero-padded hex numbers: %08lx. Because it's 32-bits I'm looking at, so I want to see them bits. And I see them much better in hex mode. There's probably been a time when I dreamt in nibbles, but I digress.

The loop can be copy&pasted (and of course bludgeoned to taste) anywhere. CRM_MIN() is one of my 'standard' macros which just returns the minimum value of two; really modern folks with a bit too much of C++ in their grey matter will suggest to write it as an inline function - which has no side effects ever -, while I am quite comfortable with the macros, with or without their side effects. Hence the little ugliness in the 'hack section':

#define CRM_MIN(a, b)    ((a) < (b) ? (a) : (b))

All uppercase for macros is enough warning for me.

Notice the single uint32_t (or should I say: crmhash_t) being typecast in that printf() there: less annoying code looks, easier adaptable thanks to loop counter, and one cast only: unsigned long. print as %08lx (or %08lX) and you get nicely printed, 8-nibble hex value for your hash. On any system.

In closing...

That concludes the process for now; remind yourself that

'always use regular 'int', use 'long' to signal readers you expect quite large values to end up in there which may overflow an int on the day following Armageddon, but do not assume that your (few!) longs will carry more bits than any of your ints.'

And then, only when you really, really need an item to be restricted so much, it may not even grow a single bit bigger, ever, use the new C99 types [u]intXX_t; given the special circumstances, one might sometimes also consider to wrap this type in a typedef'd custom type which has a more 'readable' name, such as crmhash_t (carries hashes around unharmed and unedited) or hitcount_t (keeps track of really huge hitcounts/sums).

In closing, note that the job is not completely done yet, but I'd say the major conversion to int/int32_t is done 90%. The last few bits that still need to be done are code inspection-based and eat time like nothing else: these include overall functional inspection to see if the type has been properly chosen from a functional point of view after all. It's more 'code review'-ish because one could have done the same to the 'long' code as it existed before. It's just that I find 'int'-based code so much easier to read for this (and other) purposes.