Hebbut.Net: Public Offerings: CRM114 for Win32/UNIX

The 20080327-HS-NN-patch-1.diff explained

Each section of the diff will be described shortly, so you'll understand what's going on and why those changes were made.

Generally, patch parts will be discussed in order of appearance, however sometimes forward references will be provided to other parts of the patch file which address the same issue.

Enjoy.

Ger Hobbelt

Table of Contents

Section 1: Let's go! [crm114_config.h]

Here's the start of the patch:

Common subdirectories: src/crm114.sourceforge.net/src/BSD_build_files and src/patched/crm114.sourceforge.net/src/BSD_build_files
diff -u -EbwB --strip-trailing-cr src/crm114.sourceforge.net/src/crm114_config.h src/patched/crm114.sourceforge.net/src/crm114_config.h
--- src/crm114.sourceforge.net/src/crm114_config.h  2008-03-25 22:29:16.000000000 +0100
+++ src/patched/crm114.sourceforge.net/src/crm114_config.h  2008-03-26 20:26:56.000000000 +0100
@@ -1,5 +1,5  
@@
 //  crm114_config.h  - Controllable Regex Mutilator base config, version X0.1
-//  Copyright 2001-2006 William S. Yerazunis, all rights reserved.
+//  Copyright 2001-2009, Gerrit E.G. Hobbelt (Ger Hobbelt a.k.a. [i_a] ) - William S. Yerazunis, all rights reserved.
 //  
 //  This software is licensed to the public under the Free Software
 //  Foundation's GNU GPL, version 2.  You may obtain a copy of the
 

HINT: Bill, you might consider updating your copyright notice. It's quickly done: grep for 2006 and 2007, write a few lines of sed script and everything extends nicely into 2008.

Yeah, doesn't put a dent in the development priorities. Not important.

Next: braces in macros help

@@ -297,10 +297,10 // defaults to system's, if any
 #ifdef NAME_MAX
-  #define MAX_FILE_NAME_LEN NAME_MAX+1
+#define MAX_FILE_NAME_LEN (NAME_MAX + 1)
 #else
   #ifdef FILENAME_MAX
-    #define MAX_FILE_NAME_LEN FILENAME_MAX+1
+#define MAX_FILE_NAME_LEN (FILENAME_MAX + 1)
   #else
     #define MAX_FILE_NAME_LEN 256
   #endif
 

Imagine using these in code like this:

malloc(sizeof(struct fubar) * MAX_FILE_NAME_LEN)

and wonder why it works on my box but doesn't anywhere else.

Section 2: conditionally compiling certain classifiers [crm_neural_net.c]

diff -u -EbwB --strip-trailing-cr src/crm114.sourceforge.net/src/crm_neural_net.c src/patched/crm114.sourceforge.net/src/crm_neural_net.c
--- src/crm114.sourceforge.net/src/crm_neural_net.c 2008-03-25 12:13:16.000000000 +0100
+++ src/patched/crm114.sourceforge.net/src/crm_neural_net.c 2008-03-27 03:54:21.000000000 +0100
@@ -63,6 +63,15 
@@
 //  and include the routine declarations file
 #include "crm114.h"
 
+
+
+
+#if !defined(CRM_WITHOUT_NEURAL_NET)
+
+
+
+
+
 //    the command line argc, argv
 extern int prog_argc;
 extern char **prog_argv;
 

The trick here is to have the .c file load all the [relevant] header files (the system/portability header file(s) come to mind) check if this particular classifier is really wanted. Surround the code with this:

#if !defined(CRM_WITHOUT_CLASSIFIER_XXX)

... the code ...

#else

replacement/stub

#endif

where the stub can be another special implementation of the globally available classifier routines in this source file.

By using these drop-in replacements (which produce neat error messages to report the unavailability of said classifier):

Meanwhile, all the classifier code is gone from the resulting binary, which is also beneficial in terms of intellectual property security: no reverse engineering from the binary is ever possible.

Some real meat? NN reproducibility for testers and porters

@@ -80,6 +95,17  
@@ //static int neural_trace = 1;
 static int neural_trace = 0;
 
+
+
+
+
+// define this only when you're porting this code to another platform (Win32) and need to
+// be DAMN sure the results are reproducable across the various platforms!
+#define DISABLE_RANDOM_NOISE_FOR_REPRO_CHECK 1
+
+
+
+
 #define HEADER_SIZE 1024
 
 #define STOCHASTIC
 

Included #define, which will turn off the random generator here:

@@ -201,7 +227,11 inline static double rand0to1()
 {
+#if defined(DISABLE_RANDOM_NOISE_FOR_REPRO_CHECK)
+   return 0.5;
+#else
   return ((double)rand() / (double)RAND_MAX);
+#endif
 }
 
 inline static float stochastic_factor (float stoch_noise,
 

so that CRM114 NN classifier behaviour and output can be better compared across machines, while in a testing phase.

NO, not every box has the same random generator, though for specific GCC, glibc and OS combinations one might improve reproducibility of megatest results by initializing the random generator with a fixed seed value at the start of the application. But that is just a local kludge, which won't work for AIX, Win32, or other different platforms. If you need the 'random' noise in there, while ensuring 100% reproducibility of output, you should use your own portable random generator code and initialize that generator with a fixed seed at every run. But I digress - don't I always?

A last thought in closing the random() subject here: system-provided random generators are notorious for their sometimes shoddy quality of output. Your own random generator is the only way to go if you need something suitable for simulation or other randomness-requiring tasks - such as working neural nets. Quick fix? Try a Mersenne Twister.

Static, people!

@@ -228,23 +258,22 //  These are used _only_ when a new net needs to be created;  
 //  the actual net dimensions used usually come from the file header
-long    retina_size = NN_RETINA_SIZE;
-long    first_layer_size = NN_FIRST_LAYER_SIZE;
-long    hidden_layer_size = NN_HIDDEN_LAYER_SIZE;
+static long    retina_size = NN_RETINA_SIZE;
+static long    first_layer_size = NN_FIRST_LAYER_SIZE;
+static long    hidden_layer_size = NN_HIDDEN_LAYER_SIZE;
 

This diff section is split through the middle, but this requires some special attention from the vicar: when you decide to use global variables, at least do the other source files (and their interns/implementors/...) a favor and keep your stuff to yourself: 'static' is what you need.

Crustification

And the second half of the diff block?

Just the result from a code reformatter (called uncrustify), which does wonders on crm114 code to make it readable. Included so I could plug this one.

./configure && make reindent

and you've got it all. (Of course while using the latest uncrustify.)

 

-/*
-
-Another (as yet unimplemented) idea is to stripe the retina in a
-pattern such that each neuron is "up against" every other neuron at
-least once, but not as many times as the current "full crossbar"
-system, which takes a long time to train.
-
-One way to do this is to label the points onto a fully connected graph
-nodes, and then to take the graph nodes in pairwise (or larger) groups
-such as, on edges, faces, or hyperfaces, so that every graph node is
-taken against every other node and hence every input at one level is
-"against" every other input.  However, we haven't implemented this.
 
+/*
+ * Another (as yet unimplemented) idea is to stripe the retina in a
+ * pattern such that each neuron is "up against" every other neuron at
+ * least once, but not as many times as the current "full crossbar"
+ * system, which takes a long time to train.
+ *
+ * One way to do this is to label the points onto a fully connected graph
+ * nodes, and then to take the graph nodes in pairwise (or larger) groups
+ * such as, on edges, faces, or hyperfaces, so that every graph node is
+ * taken against every other node and hence every input at one level is
+ * "against" every other input.  However, we haven't implemented this.
 */
 
 
 

Sqrt() from int at 1K or with fractions?

@@ -265,11 +294,11   if(sparse_spectrum_file_length)
   {
-    first_layer_size = sqrt (sparse_spectrum_file_length / 1024);
+        first_layer_size = (int)sqrt((int)(sparse_spectrum_file_length / 1024));
     if (first_layer_size < 4) first_layer_size = 4;
     hidden_layer_size = first_layer_size * 2 ;
     retina_size = first_layer_size * 1024  ;
-    if (retina_size < 1024) retina_size = 1024;
+        // if (retina_size < 1024) retina_size = 1024;  -- dead code: first_layer_size >= 4 anyway
   }
   if (neural_trace || user_trace )
     fprintf (stderr, "Input sparse_spectrum_file_length = %ld. \n"
 

Should the division inside the sqrt() call be integer-based - which it is before and after? When yes, better make it darn clear it is. And if you don't, divide by 1024.0 so the compiler will upgrade the division to a double-based floating point division instead.

The outside '(int)' typecast is there to prevent a compiler warning.

And then that last line in the same scope: since first_layer_size is set to a lower bound of 4 already, retina_size will never get below 1024, so the conditional lower bounding is dead code.

It Just Works [Here](tm)? How about error checking?

@@ -283,7 +312,20  
@@   h.hidden_layer_size = hidden_layer_size;
   
   //    Write out the header fields
-  fwrite(&h, 1, sizeof(NEURAL_NET_HEAD_STRUCT), f);
+    if (1 != fwrite(&h, sizeof(NEURAL_NET_HEAD_STRUCT), 1, f))
+    {
+#if 0
+        fatalerror_ex(SRC_LOC(),
+            "\n Couldn't write Neural Net header to file %s; errno=%d(%s)\n",
+            filename, errno, errno_descr(errno));
+#else
+        fatalerror(
+            "\n Couldn't write Neural Net header to file ",
+            filename);
+#endif
+        fclose(f);
+        return -1;
+    }
 
   //     Pad out the header space to header_size
   for(i = sizeof(NEURAL_NET_HEAD_STRUCT); i < HEADER_SIZE; i++)
 

I don't know, but I've found the world littered with mouths that extol on the virtues of It Works Here(tm), while no brain seems attached. And, no, nobody adds the error checking later on. RAD includes the coding practice to write the error checks while you block out the rest. XP (Extreme Programming) did for unit testing, what apparently still needs to be done for error checking in code everywhere. crm114 is no exception here. Sigh. Unfortunately error checking never will become a hype, as it only means 'extra work' to ensure robustness.

Gas Leaks & Chewing Gum

Imagine a gas-leak in your car would be fixed by your mechanic using chewing gum. You'd go spare. But back in WW II, quite a few bullet holes and other rips in gas tanks of planes and cars were [temporarily] fixed using the chewing gum method. Now answer me this riddle: why doesn't anyone appreciate the method here today? It works. It's cheap. Right?

Same goes for software [technical] design and implementation.

More error checking coming up

@@ -303,22 +345,74  
@@       i--;
       a = rand0to1 () * NN_INITIALIZATION_NOISE_MAGNITUDE;
       fprintf (stderr, "%lf ", a);
-      fwrite(&a, 1, sizeof(float), f);
+        if (1 != fwrite(&a, sizeof(a), 1, f))
+        {
+#if 0
+            fatalerror_ex(SRC_LOC(),
+                "\n Couldn't write Neural Net Coefficients to file %s; errno=%d(%s)\n",
+                filename, errno, errno_descr(errno));
+#else
+            fatalerror(
+                "\n Couldn't write Neural Net Coefficients to file ",
+                filename);
+#endif
+            fclose(f);
+            return -1;
+        }
       i--;
       a = rand0to1 () * NN_INITIALIZATION_NOISE_MAGNITUDE;
       fprintf (stderr, "%lf ", a);
-      fwrite(&a, 1, sizeof(float), f);
+        if (1 != fwrite(&a, sizeof(a), 1, f))
+        {
+#if 0
+            fatalerror_ex(SRC_LOC(),
+                "\n Couldn't write Neural Net Coefficients to file %s; errno=%d(%s)\n",
+                filename, errno, errno_descr(errno));
+#else
+            fatalerror(
+                "\n Couldn't write Neural Net Coefficients to file ",
+                filename);
+#endif
+            fclose(f);
+            return -1;
+        }
       i--;
       a = rand0to1 () * NN_INITIALIZATION_NOISE_MAGNITUDE;
       fprintf (stderr, "%lf ", a);
-      fwrite(&a, 1, sizeof(float), f);
+        if (1 != fwrite(&a, sizeof(a), 1, f))
+        {
+#if 0
+            fatalerror_ex(SRC_LOC(),
+                "\n Couldn't write Neural Net Coefficients to file %s; errno=%d(%s)\n",
+                filename, errno, errno_descr(errno));
+#else
+            fatalerror(
+                "\n Couldn't write Neural Net Coefficients to file ",
+                filename);
+#endif
+            fclose(f);
+            return -1;
+        }
       fprintf (stderr, "\n");
     };
   while(i--)
   {
       a = rand0to1 () * NN_INITIALIZATION_NOISE_MAGNITUDE;
       //      fprintf (stderr, "%lf ", a);
-      fwrite(&a, 1, sizeof(float), f);
+        if (1 != fwrite(&a, sizeof(a), 1, f))
+        {
+#if 0
+            fatalerror_ex(SRC_LOC(),
+                "\n Couldn't write Neural Net Coefficients to file %s; errno=%d(%s)\n",
+                filename, errno, errno_descr(errno));
+#else
+            fatalerror(
+                "\n Couldn't write Neural Net Coefficients to file ",
+                filename);
+#endif
+            fclose(f);
+            return -1;
+        }
   }  
   fclose(f);
   return 0;
@@ -451,11 +545,13  

Bumper sticker: I love my fatalerror_ex(). It's printf()-alike power beats fatalerror() / fatalerror5() / ...what-have-we-got. All it takes is some good knowledge of the C preprocessor, ellipsis and va_arg, and Bob's your uncle.  ;-P

Utter crud

@@   float y;
   if(isnan(a))
     {
+#if 0 // [i_a] gibberish
       char *foo;
       foo = malloc (32);
       fprintf (stderr, "Logistic of a NAN\n");
       foo = NULL;
       strcpy ("hello, world", foo);
+#endif
       fatalerror5("Tried to compute the logistic function of a NAN!!", "",
          CRM_ENGINE_HERE); 
       return 0.5; //perhaps we should escape all the way
@@ -653,9 +749,9  

Imagine that #if 0-ed piece of code in that error clause would ever be executed. Student exercise: list at least two reasons why the strcpy() will fail dramatically, including a certified coredump. Assume a well-behaved OS. Hint: both args are faulty as Hell.

64-bit is not 'long'

@@   int unigram, unique, string;
   long next_offset;
    
-  unique = apb->sflags & CRM_UNIQUE;
-  unigram = apb->sflags & CRM_UNIGRAM;
-  string = apb->sflags & CRM_STRING;
+    unique = !!(apb->sflags & CRM_UNIQUE); /* convert 64-bit flag to boolean */
+    unigram = !!(apb->sflags & CRM_UNIGRAM);
+    string = !!(apb->sflags & CRM_STRING);
   *sum = 0;
   *ate = 0;
 

Now apb->flags truly is a 64-bit type, for it would otherwise be incapable of storing many of those flag bits.

Second piece of news: you do not know which bit is used by which flag, so IWH is no excuse.

Of course, more expressive persons might have written it up as

resulting_bool = (0 != (apb->flags & CRM_FLAG));

but a basic double negative, like the '!' boolean not operator done twice, accomplishes exactly the same in much less code. There are people out there that might want to convince you that '!!' is a trademark for a certain individual touching your delicate lines of code, though. You might agree with them, but on 32-bit boxes, even while using that bludgeoned-to-death type of 'long' (which is 4 bytes on many machines anyway), there's no escape: some of these bit-checks will never work as expected if you don't listen to the compiler warnings here.

I like the '!!'. It's Tripple C or maybe even Quad C: CCC = Concise, Cool and Creative. Maybe even Catchy. There's been an argument about KISS over this. "We don't bubble baby talk because you didn't properly learn to speak a language either. Learning a language is what you need to communicate." (The quote is not mine, unfortunately. :-) )

Unused variables. 'Nuff said.

  
@@ -775,11 +871,11   long i, j, n_features;
   long old_file_size;
-  unsigned long *new_doc_start, *current_doc, *k, *l; 
+  unsigned long *current_doc, *k, *l; 
   long found_duplicate, n_docs, n_docs_trained, out_of_class, current_doc_len;
-  long n_cycles, filesize_on_disk, soft_cycle_limit;
+  long n_cycles, soft_cycle_limit;
 
-  FILE *f;
+    // FILE *f;
   
   float *dWin, *dWhid, *dWout;
   
@@ -812,10 +908,34  

Some of these are due to the additional removal of dead / unused code below.

The dreaded 0x021

@@   htext_len = apb->p1len;
   htext_len = crm_nexpandvar (htext, htext_len, MAX_PATTERN);
   
+#if 0
   i = 0;
-  while (htext[i] < 0x021) i++;
+    while (htext[i] < 0x021)
+        i++;
+//    CRM_ASSERT(i < htext_len);
   j = i;
-  while (htext[j] >= 0x021) j++;
+    while (htext[j] >= 0x021)
+        j++;
+//    CRM_ASSERT(j <= htext_len);
+#else
+    if (!crm_nextword(htext, htext_len, 0, &i, &j) || j == 0)
+    {
+#if 0
+        int fev = nonfatalerror_ex(SRC_LOC(),
+            "\nYou didn't specify a valid filename: '%.*s'\n",
+            (int)htext_len,
+            htext);
+#else
+        int fev = nonfatalerror(
+            "\nYou didn't specify a valid filename: ",
+            htext);
+#endif
+        return fev;
+    }
+    j += i;
+//    CRM_ASSERT(i < htext_len);
+//    CRM_ASSERT(j <= htext_len);
+#endif
   htext[j] = '\000';
   strcpy (filename, &htext[i]);
 

The #else of the #if 0 shows the alternative: crm_nextword() has it all. That includes a single point in the software where this type of 'range checking/skipping' is done this way.

Again, note the inclusion of some proper(?) error handling too. Empty filenames will fail further down in the code for other reasons, but that's plain sloppy and hard to comprehend for the user when the filename on display is, well, nil.

sscanf() can be checked - plus a reorder of boundary checks

@@ -846,18 +966,24  
@@
     s2_len = apb->s2len;
     s2_len = crm_nexpandvar (s2_buf, s2_len, MAX_PATTERN);
     if (s2_len > 0)
-      sscanf (s2_buf, "%lf %f %ld %f", 
+        {
+            if (4 != sscanf(s2_buf, "%lf %f %ld %f",
          &alpha, 
          &stoch_noise, 
          &soft_cycle_limit, 
-         &internal_training_threshold);
+                    &internal_training_threshold))
+            {
+                nonfatalerror("Failed to decode the 4 Neural Net setup parameters [learn]: ", s2_buf);
+            }
+            if (alpha < 0.0) alpha = NN_DEFAULT_ALPHA; /* [i_a] moved up here by 3 lines, or it would've been useless */
     if (alpha < 0.000001 ) alpha = 0.000001;
     if (alpha > 1.0) alpha = 1.0;
-    if (alpha < 0.0 ) alpha = NN_DEFAULT_ALPHA;
     if (stoch_noise > 1.0) stoch_noise = 1.0;
-    if (stoch_noise < 0) stoch_noise = NN_DEFAULT_STOCH_NOISE;
+            if (stoch_noise < 0.0) stoch_noise = NN_DEFAULT_STOCH_NOISE;
     if (soft_cycle_limit < 0) 
       soft_cycle_limit = NN_MAX_TRAINING_CYCLES_FROMSTART;
+            if (internal_training_threshold < 0.0) internal_training_threshold = NN_INTERNAL_TRAINING_THRESHOLD;
+        }
 
     //    fprintf (stderr, "**Alpha = %lf, noise = %f cycle limit = %ld\n", 
     //     alpha, stoch_noise, soft_cycle_limit);

sscanf() returns the number of successfully parsed arguments. Nice feature. If you like error checks and failure reports (fatalerror.+()).

Also note that the third alpha check was rather useless, so I guestimated it should be put on top.

And the fourth parameter also deserves its own own boundary check, to keep a consistent behaviour of coding here.

Error check and size typecast for 64-bit boxxen

@@ -890,10 +1016,13  
@@
       //   Nope, file does not exist.  Make a new one.
       if (user_trace)
    fprintf (stderr, "Making a new backing file: '%s'\n", filename);
-      make_new_backing_file(filename);
+        if (make_new_backing_file(filename))
+        {
+            return -1;
+        }
       stat(filename, &statbuf);
       if (neural_trace)
-   fprintf (stderr, "Initial file size: %ld\n", statbuf.st_size);
+            fprintf(stderr, "Initial file size: %ld\n", (long int)statbuf.st_size);
     }
 
   //   The file now ought to exist.  Map it.

Suggestion for (small?) speedup in Neural Net learn

@@ -918,7 +1047,8  
@@
       k < nn->docs_end 
    && (k[0] = 0 || k[0] == 1)  //   Find a start-of-doc sentinel
    && k[2] != sum;             //  does the checksum match?
-      k++);
+         k++) 
+            ;  // TBD: speed up by jumping over trains by using length in header k[0] ?
   
   if (k[2] == sum)
     found_duplicate = 1;
@@ -938,19 +1068,22  

Several of these 'scan all while looking for k[i] == 0 or 1' loops can be speeded up by storing the data in a slightly different format. The idea is to have the length for each document stored in the file too. For the details, let's skip ahead to another diff, where the bit-trickery is explained in some detail in the comments:

@@ -1010,37 +1202,91
      //     Offset 0 is the sentinel - always either 0 or 1
      //   Offset 0 -  Write 0 or 1 (in-class or out-class)
-     fwrite(&out_of_class, 1, sizeof(long), f);
+
+                val = out_of_class;
+               // [i_a] TBD: do this instead to provide a 'jump' offset to the next doc when scanning the CSS file lateron:
+               //            
+                // val = out_of_class | (n_features * sizeof(bag[0]) + 3 * sizeof(val));
+               //
+               // where
+//             CRM_ASSERT(sizeof(bag[0]) == sizeof(val));
+               //
+               // The above works as a byte offset and can be used to jump to the next doc, by using it this way:
+               //
+               // check in/out class by looking at bit 0:
+               //
+               //   if ((k[0] & 0x01) is 0 or 1: in-class vs. out-class) do ...
+               //
+               // jump to start of next doc:
+               //
+               //   k += k[0] / sizeof(k[0]);
+               //
+               // computer-savvy people will of course rather use this instead:
+               //
+               //   k += k[0] >> log2 of sizeof(k[0]);
+               //
+               // but since that is not easily calculated at compile-time (though a nice macro comes to mind... :-) )
+               // you can also do this here and there, as it's only bit 0 we're interested in for checking.
+               //
+                //   val = out_of_class | ((n_features + 3) << 1);
+               //
+               //   if ((k[0] & 0x01) is 0 or 1: in-class vs. out-class) do ...
+               //
+               //   k += k[0] >> 1;
+               //
+               // will work guaranteed as the CSS file size cannot be larger than MAX_UINT bytes anyway or the complete
+               // system will barf on the mmap() - which will fail somewhere before that distant limit anyhow.
+               // And since we basically store the byte-offset to the next doc, but use the fact that bit0 will always
+               // be zero for such an offset (as long as the element size is a power of 2 >= 1 :-) )
+               //
+               //
+               // Savings? Several for-loops don't need to scan all elements, but can very quickly hop to the proper
+               // matching entry.
+               // Compared to the cost of the backprop in learn, dunno how much this will bring us, but at least it's
+               // an easy optimization anyhow.
+               //
+
+                retv += fwrite(&val, sizeof(val), 1, f);    //  in-class(0) or out-of-class(1)
      

Additional points of interest when this change is going to be implemented are shown above and here and here below.

Pointer subtraction

@@
      //   input set.   Cut it out of the 
      //   knowledge base.
      unsigned long *dest;      
-     long len;
+            int len;
      if (neural_trace)
        fprintf (stderr, "Obliterating data...");
      
      dest = k;
      //   Find the next zero sentinel
-     for(k += 3; *k; k++);
+            for (k += 3; *k; k++)
+                ;
      //   and copy from here to end of file over our
      //    incorrectly attributed data, overwriting it.
-     len = (unsigned long) k - (unsigned long) dest;
+            len = (int)((k - dest) * sizeof(k[0]));
      if (neural_trace)
-       fprintf (stderr, "start %lx length %ld\n",
-            (unsigned long) dest, len);
+            {
+                fprintf(stderr, "start %p length %d\n",
+                    dest, len);
+            }
      for (; k< nn->docs_end; k++)
        {
          *dest = *k;

Casting pointers to some sort of 'long' and then subtracting them to get at the number of bytes between them is 'not done' since Queen Victoria.

Pointer subtraction produces the distance as the number of elements, so all we need to do is multiply that number with the size of one(1) element in a portable fashion. Use sizeof(variable) instead of sizeof(type) so you don't risk trouble when you (or someone else) change the variable type: a variant on providing a  'single point of failure': if we change the variable type, everything changes with it in this way. No need to scan the complete code to find those elusive lurking sizeof(type)s that will hurt when you port or upgrade the code.

Error checking continued - fatalerror_ex() galore. ... And The Printf("%p")

@@ -960,11 +1093,31  
@@
      //    now unmap the file, msync it, unmap it, and truncate
      //    it.  Then remap it back in; viola- the bad data's gone.
      force_unmap_file (nn, filename);
-     truncate (filename, statbuf.st_size - len);
-     map_file (nn, filename);
+           i = truncate(filename, statbuf.st_size - len);
+           if (i)
+           {
+#if 0
+               fatalerror_ex(SRC_LOC(), "Failed to truncate the Neural Net CSS 'learn' file '%s': error %d(%s)\n",
+                    filename,
+                   errno,
+                   errno_descr(errno));
+#else
+               fatalerror("Failed to truncate the Neural Net CSS 'learn' file ",
+                    filename);
+#endif
+               return -1;
+           }
+            
+           i = map_file(nn, filename);
+           if (i < 0)
+           {
+               fatalerror("Could not create the neural net backing file ",
+                   filename);
+               return -1;
+           }
      if (neural_trace)
-       fprintf (stderr, "Dealt with a duplicate at %lx\n", 
-            (unsigned long) k);
+                fprintf(stderr, "Dealt with a duplicate at %p\n",
+                    k);
    }
       else
    {

One thing only that begs to be mentioned: since the dawn of time, printf() has provided the '%p' format to display pointers. Those buggers come in all shapes and sizes, but %p takes care of it all.

Don't try to scanf("%p") though, because you'll get burned for a slew of reasons.

Just another comment

@@ -972,6 +1125,11  
@@
        fprintf(stderr, "***Learning same file twice.  W.T.F. \n");
    };
     };
+
+   //   [i_a] GROT GROT GROT a risky statement, though elegant in a way:
+   //         update the dafault NN file size setting for the remainder of
+   //         the run-time duration (OR when another NN CSS file was learned
+   //         as that would pass through here again).
   retina_size = nn->retina_size;
 
   //   then this data file has never been seen before so append it and remap

An integer smaller than 0.5?

@@ -985,7 +1143,7  
@@        old_file_size);
 
       //   make sure that there's something to add.
-      if(n_features < 0.5 )
+        if (n_features < 0.5) /* [i_a] huh? n_features is an int/long! */
    {
      if(neural_trace)
        fprintf(stderr, 

n_features is a 'long' integer. Of course, the comparison is valid, but it makes me wonder.

Variables in local scope, the double boolean negative for 64-bit flags again and extensive file I/O error checking

@@ -993,12 +1151,46  
@@    } 
       else
    {
+            FILE *f;
+
 
      // unmap the file so we can write-append to it.
      force_unmap_file(nn, filename);
      f = fopen(filename, "ab+");
-     out_of_class = (apb->sflags & CRM_REFUTE);
-     if (out_of_class != 0) out_of_class = 1;   // change bitpat to 1 bit
+            if (f == NULL)
+            {
+#if 0
+                int fev = fatalerror_ex(SRC_LOC(),
+                    "\n Couldn't open your new CSS file %s for append; errno=%d(%s)\n",
+                    filename,
+                    errno,
+                    errno_descr(errno));
+#else
+                int fev = fatalerror(
+                    "\n Couldn't open your new CSS file for append; \n",
+                    filename);
+#endif
+                return fev;
+            }
+            else
+            {
+                long /* crmhash_t */ val;
+                int retv;
+
+                //     And make sure the file pointer is at EOF.
+                (void)fseek(f, 0, SEEK_END);
+
+                retv = 0;
+                if (ftell(f) <= HEADER_SIZE)
+                {
+                   fatalerror("Neural Net CSS store seems to be corrupt. Filename: ",
+                            filename);
+                        fclose(f);
+                        return -1;
+                }
+
+                out_of_class = !!(apb->sflags & CRM_REFUTE);   // change bitpat to 1 bit
+//   if (out_of_class != 0) out_of_class = 1;   // change bitpat to 1 bit
      if(internal_trace || neural_trace)
        {
          fprintf (stderr, "Sense writing %ld\n", out_of_class);

See above for a discussion of the '!!' double unary boolean not operator effects on 64-bit flag values being converted to int-eger sized values. No, that 'if (out_of_class != 0) doesn't cut it, as it's too late: the high bits of the bitwise & operation output are already lost.

For cross-platform portability's sake, 'val' should be something of defined bit-width, such as crmhash_t, which is a custom type typedef'ed to be an 'uint32_t'. Alas, prepare the code for this portability by making sure one type (means one variable) is used to write the individual elements to file.

And then there's the file I/O checking: are we positioned at the end of the file (ftell()) and as usual, every system/runtime library call should be error checked as well. It all starts with a check for fopen(file, "ab+") errors and goes from there. Did I just hear you say "paranoid"? You bet.

fwrite() file I/O result checking: did we get everything to disk?

More file I/O (fwrite() !) error checking below. Obscured by the huge 'TBD=To Be Done' comment describing an idea I had while trying to comprehend the NN code. Note that the fwrite()s are used such that each will return the number of successfully written elements, which we count and can easily validate at the end: if (retv != 1 + 1 + 1 + n_features) then barf.

@@ -1010,37 +1202,91
      //     Offset 0 is the sentinel - always either 0 or 1
      //   Offset 0 -  Write 0 or 1 (in-class or out-class)
-     fwrite(&out_of_class, 1, sizeof(long), f);
+
+                val = out_of_class;
+               // [i_a] TBD: do this instead to provide a 'jump' offset to the next doc when scanning the CSS file lateron:
+               //            
+                // val = out_of_class | (n_features * sizeof(bag[0]) + 3 * sizeof(val));
+               //
+               // where
+//             CRM_ASSERT(sizeof(bag[0]) == sizeof(val));
+               //
+               // The above works as a byte offset and can be used to jump to the next doc, by using it this way:
+               //
+               // check in/out class by looking at bit 0:
+               //
+               //   if ((k[0] & 0x01) is 0 or 1: in-class vs. out-class) do ...
+               //
+               // jump to start of next doc:
+               //
+               //   k += k[0] / sizeof(k[0]);
+               //
+               // computer-savvy people will of course rather use this instead:
+               //
+               //   k += k[0] >> log2 of sizeof(k[0]);
+               //
+               // but since that is not easily calculated at compile-time (though a nice macro comes to mind... :-) )
+               // you can also do this here and there, as it's only bit 0 we're interested in for checking.
+               //
+                //   val = out_of_class | ((n_features + 3) << 1);
+               //
+               //   if ((k[0] & 0x01) is 0 or 1: in-class vs. out-class) do ...
+               //
+               //   k += k[0] >> 1;
+               //
+               // will work guaranteed as the CSS file size cannot be larger than MAX_UINT bytes anyway or the complete
+               // system will barf on the mmap() - which will fail somewhere before that distant limit anyhow.
+               // And since we basically store the byte-offset to the next doc, but use the fact that bit0 will always
+               // be zero for such an offset (as long as the element size is a power of 2 >= 1 :-) )
+               //
+               //
+               // Savings? Several for-loops don't need to scan all elements, but can very quickly hop to the proper
+               // matching entry.
+               // Compared to the cost of the backprop in learn, dunno how much this will bring us, but at least it's
+               // an easy optimization anyhow.
+               //
+
+                retv += fwrite(&val, sizeof(val), 1, f);    //  in-class(0) or out-of-class(1)
      
      //  Offset 1 - Write the number of passes without a 
      //  retrain on this one
      i = 0; //number of times we went without training this guy
-     fwrite(&i, 1, sizeof(long), f);
+                val = i;
+                retv += fwrite(&val, sizeof(val), 1, f);    // number of times notrain
      
      //  Offset 2 - Write the sum of the document feature hashes.
-     fwrite(&sum, 1, sizeof(long), f);   //hash of whole doc
+                val = sum;
+                retv += fwrite(&val, sizeof(val), 1, f);    // hash of whole doc- for fastfind
      
      //       ALERT ALERT ALERT
      //    CHANGED to save the bag, _not_ the projection!
      //    This means we must project during training, however
      //    it also means we don't bust the CPU cache nearly as 
      //    badly!   
-     fwrite (bag, n_features, sizeof(long), f);
-     if (neural_trace)
+                retv += fwrite(bag, sizeof(bag[0]), n_features, f); // the actual data
+                if (internal_trace)
        { 
          fprintf (stderr, 
               "Appending 3 marker longs and %ld longs of features\n",
               n_features);
          fprintf (stderr, "First three features are %ld %ld %ld\n",
               bag[0], bag[1], bag[2]);
+                    if (retv != 1 + 1 + 1 + n_features)
+                    {
+                        fatalerror("Couldn't write the solution to the .hypsvm file named ",
+                            filename);
+                        fclose(f);
+                        return -1;
+                    }
        }
      //    Close the file and we've now updated the disk image.
      fclose(f);
    }
+        }
+
       stat(filename, &statbuf);
-      filesize_on_disk = statbuf.st_size;
       if(neural_trace)
-        fprintf(stderr, "NN: statted filesize is now %ld.\n", statbuf.st_size);
+            fprintf(stderr, "NN: statted filesize is now %ld.\n", (long int)statbuf.st_size);
       
       //   MMap the file back into memory and fix up the nn->blah structs
       if(map_file(nn, filename))

printf("%p") and a bit of unused code

Note the '%p's everywhere in those printf() statements.

And as a side note: grep for it, but new_doc_start is initialized here but never ever used:

@@ -1050,34 +1296,37  
@@
    }
 
       if (neural_trace)
-   fprintf (stderr, "Neural network mapped at %lx\n", (long) nn);
+            fprintf(stderr, "Neural network mapped at %p\n", nn);
 
-      new_doc_start = 
-   (unsigned long *) ((char *)(nn->file_origin) + old_file_size);
+        // CRM_ASSERT(old_file_size % sizeof(crmhash_t) == 0);        [i_a] will NOT work for VERSIONed CSS files
+        //new_doc_start =
+        //    (crmhash_t *)((char *)(nn->file_origin) + old_file_size);    --- unused code
 
       //    Print out a telltale to see if we're in the right place.
       if (neural_trace)
    { 
      fprintf (stderr, "\nAfter mmap, offsets of weight arrays"
-          " are:\n   Win: %lx, Whid: %lx, Wout: %lx\n",
-          (long)nn->Win, (long)nn->Whid, (long)nn->Wout);
+                            " are:\n   Win: %p, Whid: %p, Wout: %p\n",
+                nn->Win, nn->Whid, nn->Wout);
      fprintf (stderr, "First weights (in/hid/out): %lf, %lf, %lf\n", 
           nn->Win[0], nn->Whid[0], nn->Wout[0]);
      fprintf (stderr, "Weight ptrs from map start : "
-          "Win: %lx, Whid: %lx, Wout: %lx\n",
-          (long) ((void *)&nn->Win[0]),
-              (long) ((void *)&nn->Whid[0]),
-          (long) ((void *)&nn->Wout[0]));
+                            "Win: %p, Whid: %p, Wout: %p\n",
+                &nn->Win[0],
+                &nn->Whid[0],
+                &nn->Wout[0]);
      fprintf (stderr, "Weight ptrs (arefWxxx funcs: "
-          "Win: %lx, Whid: %lx, Wout: %lx\n",
-          (long) ((void *)arefWin(nn, 0, 0)),
-          (long) ((void *)arefWhid(nn, 0, 0)),
-          (long) ((void *)arefWout(nn, 0, 0)));
+                            "Win: %p, Whid: %p, Wout: %p\n",
+                arefWin(nn, 0, 0),
+                arefWhid(nn, 0, 0),
+                arefWout(nn, 0, 0));
    };
 
     }
   else
-    new_doc_start = k;
+   {
+        // new_doc_start = k;   [i_a] unused
+   }
   
   //    If we're in APPEND mode, we don't train yet.  So we're
   //    complete at this point and can return to the caller.

A marker for the speed improvement idea (#1)

@@ -1102,6 +1351,7  
@@
     {
       if (*k < 2)
       n_docs++;
+       // [i_a] TBD   speed up and improve by providing jump distance in k[0] ?
     };
 
   n_cycles = 0;

If we are going to do the speed enhancement as described above, this is one of the three to four places which merit attention (and code change).

"%p" again

@@ -1237,8 +1487,8  
@@      
      if (neural_trace)
        {
-         fprintf (stderr, " Doc %lx out_of_class %ld \n", 
-              (unsigned long int) k, 
+                fprintf(stderr, " Doc %p out_of_class %ld\n",
+                    k,
               out_of_class);
          fprintf (stderr, 
               "First weights (in/hid/out): %lf, %lf, %lf\n", 

A marker for the speed improvement idea (#2)

... and just a few more "%p" spots, thrown in for good measure.

@@ -1262,24 +1512,28  
@@
        {
          k++;
          current_doc_len++;
+               // [i_a] TBD   speed up and improve by providing jump distance in k[0] ? Be aware of the previous increments of k then; we're beyond the k[0]/k[1]/k[2] header here.
        };
      //    k now points to the next document
      if (neural_trace)
+            {
        fprintf (stderr, 
-            "Current doc %lx class %ld len %lx next doc start %lx next sense %lx\n",
-            (unsigned long) current_doc,
-            (unsigned long) out_of_class,
-            (unsigned long) current_doc_len,
-            (unsigned long) k,
-            (unsigned long) *k);
+                    "Current doc %p class %ld len %ld next doc start %p next sense %08lx\n",
+                    current_doc,
+                    out_of_class,
+                    current_doc_len,
+                    k,
+                    (unsigned long int)*k);
+            }
 
          //      Run the net on the document
      do_net(nn, current_doc, current_doc_len);
      if( neural_trace )
-       fprintf(stderr, "-- doc @ %lx got network outputs of %f v. %f\n",
-           (unsigned long)current_doc,
+            {
+                fprintf(stderr, "-- doc @ %p got network outputs of %f v. %f\n",
+                    current_doc,
            nn->output_layer[0], nn->output_layer[1]);
-
+            }
      //       Keep track of the total error of this epoch.
      errmag = (! (out_of_class)) ?
        nn->output_layer[1] + (1 - nn->output_layer[0])

Code readability; just another %p

Using braces following if()'s doesn't hurt either. Even when you know those if() won't ever have more than a single statement.

I didn't bother to 'fix' this everywhere: uncrustify can been configured to do just that (among other things).

Of course, the actual change here is just the "%p" in the fprintf(), but what the hey. It's 06:30 AM and about bloody time for another micro-rant for my dearly beloved, right?

@@ -1361,8 +1615,11  
@@
    //   net got it wrong, or because we're in TRAIN_ALWAYS.  
    //   So, it's time to run the ugly backprop.
    //
-   if(neural_trace) fprintf(stderr, " Need backprop on doc at %lx\n", 
-                (long) current_doc );
+   if (neural_trace)
+                {
+                    fprintf(stderr, " Need backprop on doc at %p\n",
+                        current_doc);
+                }
 
        //       Start setting the errors.  Work backwards, from the
    //       known desired states.

Ah, guv', can ye spare a few %p for me?

@@ -1603,14 +1860,14  
@@
      else
        {
          if(neural_trace)
-       fprintf(stderr, "Doc %lx / %ld OK, looks good, not trained\n",
-           (long unsigned) current_doc, out_of_class);       
+                   fprintf(stderr, "Doc %p / %ld OK, looks good, not trained\n",
+                        current_doc, out_of_class);
        };      //      END OF THE PER-DOCUMENT LOOP
 
      if (neural_trace)
        fprintf (stderr, 
-            "Moving to next document at k=%lx to docs_end= %lx\n",
-            (long int)k, (long int)nn->docs_end);
+                    "Moving to next document at k=%p to docs_end=%p\n",
+                    k, nn->docs_end);
    };
       n_cycles++;
       if (neural_trace)

Bitwise is not boolean

@@ -1683,7 +1940,7  
@@
   //    Do we microgroom?  
   //    GROT GROT GROT don't do this yet - this code isn't right.
   //    GROT GROT GROT
-  if(apb->sflags & CRM_MICROGROOM & 0)
+    if (apb->sflags & CRM_MICROGROOM && 0) /* [i_a] */
     {
       int trunced = 0;
       for(k = nn->docs_start; k < nn->docs_end;)

Some guys have all the luck. Yes, the old one works too, but better make it a habit to get repetitive on your & or | keys, when you want to quickly disable a code section. Better yet: spend a few quid more in typing effort and surround it in

#if 0
...
#endif

for some unmundane pleasure.

The error check love affair - part deux

@@ -1704,7 +1961,21  
@@
       if(trunced)
    {
      crm_force_munmap_filename(filename);
-     truncate(filename, trunced);
+           i = truncate(filename, trunced);
+           if (i)
+           {
+#if 0
+               fatalerror_ex(SRC_LOC(), "Failed to truncate the Neural Net CSS 'learn' file '%s' during microgroom: error %d(%s)\n",
+                    filename,
+                   errno,
+                   errno_descr(errno));
+#else
+               fatalerror("Failed to truncate the Neural Net CSS 'learn' file during microgroom: ",
+                    filename);
+#endif
+               return -1;
+           }
+
      if(neural_trace)
        fprintf(stderr, "\nleaving neural net learn after truncating"
            ", n_docs = %ld\n", n_docs);
@@ -1872,7 +2143,8  

Lovely fatalerror_ex() there, don't you think?

Bill recently said something about comments being out of sync

@@   suc_pR = get_pR(suc_p);
   out_pos = 0;
   
-  if(suc_p > 0.5 ) //test for nan as well
+    //test for nan as well ??? where
+    if (suc_p > 0.5)
     out_pos += sprintf
       ( outbuf + out_pos,
    "CLASSIFY succeeds; success probability: %f  pR: %6.4f\n",
@@ -1911,3 +2183,30  

Better mortals would have written WTF. Is this comment about NaN checking alluring at some yet undisclosed additional check for NaN pR values (yay!) or ... ?

And some goodness at the end: removing classifiers from the code at compile time: part 2

@@     }  
   return 0;
 }
+
+#else /* CRM_WITHOUT_NEURAL_NET */
+
+int crm_neural_net_learn(CSL_CELL *csl, ARGPARSE_BLOCK *apb,
+                         char *txtptr, int txtstart, int txtlen)
+{
+    return nonfatalerror_ex(SRC_LOC(),
+        "ERROR: the %s classifier has not been incorporated in this CRM114 build.\n"
+        "You may want to run 'crm -v' to see which classifiers are available.\n",
+        "Neural Net");
+}
+
+
+int crm_neural_net_classify(CSL_CELL *csl, ARGPARSE_BLOCK *apb,
+                            char *txtptr, int txtstart, int txtlen)
+{
+    return nonfatalerror_ex(SRC_LOC(),
+        "ERROR: the %s classifier has not been incorporated in this CRM114 build.\n"
+        "You may want to run 'crm -v' to see which classifiers are available.\n",
+        "Neural Net");
+}
+
+#endif /* CRM_WITHOUT_NEURAL_NET */
+
+
+
+

Remember where we had that CRM_WITHOUT_xxxx preprocessor check at the start of the code? Here's the closure for that one: just as an example how the GerH builds provide error-reporting stubs for compile-time disabled classifiers.

All this for the purpose of not letting any crm script pass silently by.

Section 3: [crm_osbf_bayes.c] a piece of cake

diff -u -EbwB --strip-trailing-cr src/crm114.sourceforge.net/src/crm_osbf_bayes.c src/patched/crm114.sourceforge.net/src/crm_osbf_bayes.c
--- src/crm114.sourceforge.net/src/crm_osbf_bayes.c 2007-11-26 13:10:04.000000000 +0100
+++ src/patched/crm114.sourceforge.net/src/crm_osbf_bayes.c 2008-03-26 20:26:56.000000000 +0100
@@ -453,7 +453,7  
@@      memmove (tempbuf, ts.ptok, ts.toklen);
      tempbuf[ts.toklen] = '\000';
      fprintf (stderr,
-          "  Learn #%ld t.o. %ld strt %d end %d len %lu is -%s-\n",
+          "  Learn #%ld t.o. %ld strt %ld end %ld len %lu is -%s-\n",
           i,
           textoffset,
           ts.ptok - (unsigned char *) &(txtptr[textoffset]),
@@ -1085,7 +1085,7  
@@      memmove (tempbuf, ts.ptok, ts.toklen);
      tempbuf[ts.toklen] = '\000';
      fprintf (stderr,
-          "  Classify #%ld t.o. %ld strt %d end %d len %lu is -%s-\n",
+          "  Classify #%ld t.o. %ld strt %ld end %ld len %lu is -%s-\n",
           i,
           textoffset,
           ts.ptok -

Just a fix for a few compiler warnings on 64-bit boxes.

Section 4: Beam me up, Scotty! [crm_osb_hyperspace.c]

        
diff -u -EbwB --strip-trailing-cr src/crm114.sourceforge.net/src/crm_osb_hyperspace.c src/patched/crm114.sourceforge.net/src/crm_osb_hyperspace.c
--- src/crm114.sourceforge.net/src/crm_osb_hyperspace.c 2008-03-26 03:02:13.000000000 +0100
+++ src/patched/crm114.sourceforge.net/src/crm_osb_hyperspace.c 2008-03-27 03:20:00.000000000 +0100

USE_FIXED_UNIQUE_MODE? Two Hyperspace results for the price of one

@@ -42,6 +42,20  
@@
#endif
 
 
+
+
+#if !defined (CRM_WITHOUT_OSB_HYPERSPACE)
+
+
+
+
+
+#undef USE_FIXED_UNIQUE_MODE 
+
+
+
+
+
 //////////////////////////////////////////////////////////////////
 //
 //     Hyperspatial Classifiers

And there's CRM_WITHOUT_xxxxx conditional compile of a classifier again. Sigh.

But wait! What's USE_FIXED_UNIQUE_MODE doing there? Simple, #undef'ed it means the old 'unique' handling code below (here and here) should remain as it is, so that you verify that the results of the classifier remain the same.

Next, #define this one to, say, 1, like this:

#define USE_FIXED_UNIQUE_MODE 1

and recompile - after making sure the related code sections are hunky dory. Answers from a different universe result: counts are ever so slightly off, hitcounts change more or less, and your pR are 'close', but no cigar.

Warning: stupid glitch of yours truly

Most compilers don't yak about the code using the USE_FIXED_UNIQUE_MODE define, but I made a stupid error myself by writing it as

#if USE_FIXED_UNIQUE_MODE
    ...

while handcoding in the above #undef in the separate source tree copy where I have been merging several of my edits with Bill's original wget.

So if your compiler is a little more strict / pedantic than mine, you may need to replace this with #define USE_FIXED_UNIQUE_MODE 0 to disable the new 'unique' patch. Unfortunately I haven't tested this code on MSVC/Win32 because that would cause simply cause too much fuss for other reasons/code spots, but MSVC2005 does recognize this type of mistake. My bad luck and another hint why portable code QC should always use two different compiler vendor products at least.

Typo in comment

@@ -140,7 +154,7  
@@ 
 //    Option B3 - Use shared bit allocations, so classes use shared
 //      bit patterns.  The bad news is that this generates phantom classes
 // 
-//    Option C1 - use MySQL to do the storage.  Easy to implennt
+//    Option C1 - use MySQL to do the storage.  Easy to implement
 //
 //    *****************************************************************
 //
@@ -318,10 +332,27  

'Nuff said.

And here's the icepick in the forehead (Zappa) - again: HEX 0x021 revisited

@@   //             grab the filename, and stat the file
   //      note that neither "stat", "fopen", nor "open" are
   //      fully 8-bit or wchar clean...
+#if 0
   i = 0;
-  while (htext[i] < 0x021) i++;
+    while (htext[i] < 0x021)
+        i++;
+//    CRM_ASSERT(i < hlen);
   j = i;
-  while (htext[j] >= 0x021) j++;
+    while (htext[j] >= 0x021)
+        j++;
+//    CRM_ASSERT(j <= hlen);
+#else
+ if (!crm_nextword(htext, hlen, 0, &i, &j) || j == 0)
+ {
+            fev = nonfatalerror(
+               "\nYou didn't specify a valid filename: ", 
+                   htext); 
+            return fev;
+ }
+ j += i;
+//    CRM_ASSERT(i < hlen);
+//    CRM_ASSERT(j <= hlen);
+#endif
 
   //             filename starts at i,  ends at j. null terminate it.
   htext[j] = '\000';
@@ -330,7 +361,7  

Identical in approach, cause and effect as the fix for neural net shown above.

alloc(size) vs alloc(size + 1) - not a guestimate but a Sure Thing

@@   //   Note that during a LEARN in hyperspace, we do NOT use the mmap of
   //    pre-existing memory.  We just write to the end of the file instead.
   //    malloc up the unsorted hashbucket space  
-  hashes = calloc (HYPERSPACE_MAX_FEATURE_COUNT, 
+    hashes = calloc(HYPERSPACE_MAX_FEATURE_COUNT + 1 /* still space for the sentinel at worst case! */, 
           sizeof (HYPERSPACE_FEATUREBUCKET_STRUCT));
   hashcounts = 0;
   //  put in a zero as the start marker.
@@ -478,14 +509,14  

As I understand it Hyperspace stores series of hashes to file, where each series is terminated by a NUL sentinel. When the hash series produced by VT actually reaches the maximum allowed length of HYPERSPACE_MAX_FEATURE_COUNT, there should still be valid memory available to store the subsequent NUL sentinel hash. Hence the need to allocate space for HYPERSPACE_MAX_FEATURE_COUNT+1 hashes, while only HYPERSPACE_MAX_FEATURE_COUNT get filled by the VT.

See below (here and here, for example) for other related items: the Hyperspace fixes assume the above description is correct. Because if it is, the current HS code does not seem to write the NUL sentinel to disk (though the classifier looks for it when in 'classify' mode :-S ) and yet seems to work. Somehow.

Stay tuned for more.

Unused code and a comment fix

@@      {
        unsigned long h1;
        unsigned long h2;
-       long th = 0;         // a counter used for TSS tokenizing
+            // int th = 0;               // a counter used for TSS tokenizing
        long j;
        //
        //     old Hash polynomial: h0 + 3h1 + 5h2 +11h3 +23h4
        //     (coefficients chosen by requiring superincreasing,
        //     as well as prime)
        //
-       th = 0;
+            // th = 0;
        //
        if (use_unigram_features == 1)
          {
@@ -532,7 +563,7 
#ifdef NotInAMillionYears
 
-   First, get any optional
+    //   First, get any optional
   //   tokenizer pipeline setups (defined by the keyword "pipeline",
   //   followed by the number of pipeline vectors, followed by the length
   //   of the pipeline vectors, followed by the pipeline weight (must

The USE_FIXED_UNIQUE_MODE unique mode fix - part 1

@@ -646,6 +677,49  
@@ 
   //   And uniqueify the hashes array
   //
   
+#if USE_FIXED_UNIQUE_MODE
+
+    if (internal_trace)
+   {
+        fprintf(stderr, "Pre-Unique: %ld as:\n", hashcounts);
+       for (i = 0; i < hashcounts; i++)
+       {
+           fprintf(stderr, "hash[%6ld] = %08lx\n", i, (unsigned long int)hashes[i].hash);
+       }
+   }
+  
+  if (unique)
+    {
+    i = 1;
+    j = 1;
+      while ( i < hashcounts )
+      {
+   if (hashes[i].hash != hashes[i-1].hash
+       //      || hashes[i].key != hashes[i-1].key )
+       )
+     {
+       hashes[j]= hashes[i-1];
+       j++;
+     };
+   i++;
+      };
+      hashcounts = j;
+    };
+
+  //    Put in a sentinel zero.
+  hashes[hashcounts].hash = 0;
+
+  //Debug print
+
+    if (internal_trace)
+   {
+        fprintf(stderr, "Post-Unique: %ld as:\n", hashcounts);
+       for (i = 0; i < hashcounts; i++)
+       {
+           fprintf(stderr, "hash[%6ld] = %08lx\n", i, (unsigned long int)hashes[i].hash);
+       }
+   }
+#else
   i = 0;
   j = 0;
 

No, it's not enough to set i and j to zero at the start of the loop or some such.

There so much wrong with the old code, where to begin:

Anyhow, it's much easier to look back while looking for dupes: assuming the first element is unique anyhow (it is for this algorithm), any subsequently visited element in the original list (i.e. index range 1..hashcounts-1, all inclusive) can be simply compared with its predecessor and skipped when found to match or shifted/copied when found to be different, i.e. 'unique'.

Use j to track the shift/fill-position as before so it'll be nice and be the new hashcount when we're done.

Please don't go fancy on me by decrementing hashcount after that, so that we loose another unique hash in there - because the rest of the code thinks 'hashcounts' is the total length of the hash list anyhow. Ahem, 'fortunately', this last little bit of voodoo kludging only happened at the second instance. Boy, what a mess. She's totally soaked.

@@ -693,6 +767,8  
@@ 
        hashes[6].hash,
         hashes[7].hash);
 
+#endif
+
   if (user_trace)
     fprintf (stderr, "Unique hashes generated: %ld\n", hashcounts);
   //    store hash count of this document in the first bucket's .key slot

and that was the terminating #endif for this section of USE_FIXED_UNIQUE_MODE.

And now for something completely different: error checking ... but ...

there's also a little bit of USE_FIXED_UNIQUE_MODE stuff in there. Why?

Because I was a bit lazy and unsure whether I had understood the Hyperspace storage format correctly, so I used the same #define to temporarily disable this particular fix - which has nothing to do with 'unique' whatsoever - while testing the megatest numbers against Bill's vanilla code tree. Once that matched up, I turned on '#define USE_FIXED_UNIQUE_MODE 1' and all was well. But megatest never got back to the original numbers as listed by Bill then. ;-)

What happens here is that the fwrite() is instructed to include the NUL sentinel hash, which terminates a 'document', in the output to disc, so that this hash series will be recognized as an individual document by the 'classify' code later on, like it should. megatest does not reveal this flaw, as it tests one document only anyway, so sentinel in the CSS or not does not matter all that much to megatest.

Additionally, the validation code also checks to make sure fwrite() has written the sentinel too. Hence, twice the preproc-selectable '+1' logic below:

@@ -715,13 +791,41  
@@
    fprintf (stderr, "Opening hyperspace file %s for append.\n", 
         hashfilename);
       hashf = fopen ( hashfilename , "ab+");
+        if (hashf == 0)
+        {
+            fatalerror("For some reason, I was unable to append-open the file named ",
+                    hashfilename);
+        }
+        else
+        {
+            int ret;
+
       if (user_trace)
    fprintf (stderr, "Writing to hash file %s\n", hashfilename);
+
+            //     And make sure the file pointer is at EOF.
+            (void)fseek(hashf, 0, SEEK_END);
+
       //    and write the sorted hashes out.
-      fwrite (hashes, 1, 
-         sizeof (HYPERSPACE_FEATUREBUCKET_STRUCT) * hashcounts, 
+//         CRM_ASSERT(hashes[hashcounts].hash == 0);
+//         CRM_ASSERT(hashcounts > 0 ? hashes[hashcounts - 1].hash != 0 : TRUE);
+            ret = fwrite(hashes, sizeof(HYPERSPACE_FEATUREBUCKET_STRUCT),
+#if USE_FIXED_UNIQUE_MODE
+               1 +
+#endif
+hashcounts,      /* [i_a] GROT GROT GROT shouldn't this be 'hashcounts+1', just like SVM/SKS? */
          hashf);
+            if (ret != hashcounts
+#if USE_FIXED_UNIQUE_MODE
+               + 1
+#endif
+               )
+            {
+                fatalerror("For some reason, I was unable to append a hash series to the file named ",
+                        hashfilename);
+            }
       fclose (hashf);
+        }
       
       //  let go of the hashes.
       free (hashes);

Can Do Counting

@@ -744,7 +848,9  
@@
       long thisstart, thislen, thisend;
       double bestrad;
       long wrapup;
-      double kandu, unotk, knotu, dist, radiance;
+        long kandu;
+        long unotk, knotu;
+        double dist, radiance;
       long k, u;
       long file_hashlens;
       HYPERSPACE_FEATUREBUCKET_STRUCT *file_hashes;

grep the code for their usage, but kandu et al are integers counters. Then make it so. Because counting using floating point numbers will not get you the same result (it's exact value may be ever so slightly off after many operations, leading to floor(), round(), integer casts and other discretization operations running the risk of some-times-off-by-one glitches. Floating point calculus is imprecise by definition. Don't use it if you don't need it, and here one definitely doesn't need it.

There's another case of this a little further down the tract.

Boundary overruns and core dumps

@@ -797,8 +903,8  
@@
      if (internal_trace)
        fprintf (stderr, 
           "At featstart, looking at %ld (next bucket value is %ld)\n",
-          file_hashes[thisstart].hash,
-          file_hashes[thisstart+1].hash);
+                        (unsigned long int)file_hashes[thisstart].hash,
+                       (thisstart + 1 < file_hashlens ? (unsigned long int)file_hashes[thisstart + 1].hash : 0));
      while (wrapup == 0)
          {
        //    it's an in-class feature.

Accessing file_hashes[thisstart+1] like this, even when at the boundary of [allocated] memory, is a heartache waiting to happen. The code is sanitized by additional boundary checks where applicable.

Note: there may be several other spots throughout the code where these display or other code lines may cause coredump ache due to out-of-bounds memory accesses like these. I haven't checked rigorously for that. Mea culpa.

Anyway, here's another one of those ...

@@ -861,14 +967,18  
@@
      if (internal_trace) 
        fprintf (stderr, 
             "At featend, looking at %ld (next bucket value is %ld)\n",
-            file_hashes[thisend].hash,
-            file_hashes[thisend+1].hash);
+                        (unsigned long int)file_hashes[thisend].hash,
+                       (thisend + 1 < file_hashlens ? (unsigned long int)file_hashes[thisend + 1].hash : 0));
 
      //  end of a document- process accumulations
      
      //    Proper pythagorean (Euclidean) distance - best in
      //   SpamConf 2006 paper
+#if 10 && defined (GER)
+            dist = sqrt(unotk + knotu);
+#else
      dist = sqrtf (unotk + knotu) ;
+#endif
      
      // PREV RELEASE VER --> radiance = 1.0 / ((dist * dist )+ 1.0);
      //

The #if 10 is in there because calculations at 'double' precision produce more accurate results at almost no costs. When you take extra special care of your code, 'float' can be faster, even on modern FPUs, but that does require some additional code activity.

Comment typo. duh

@@ -927,7 +1037,7  
@@
}
 
 
-//      How to do a Osb_Bayes CLASSIFY some text.
+//      How to do a Osb_Hyperspace CLASSIFY some text.
 //
 int crm_expr_osb_hyperspace_classify (CSL_CELL *csl, ARGPARSE_BLOCK *apb,
                 char *txtptr, long txtstart, long txtlen)

Can Do Counting - The Sequel

See above for the first movie.

@@ -999,9 +1109,9  
@@
   //     Basic match parameters
   //     These are computed intra-document, other stuff is only done
   //     at the end of the document.  
-  float knotu;   // features in known doc, not in unknown
-  float unotk;   // features in unknown doc, not in known
-  float kandu;   // feature in both known and unknown
+    long knotu; // features in known doc, not in unknown
+    long unotk; // features in unknown doc, not in known
+    long kandu; // feature in both known and unknown
   
   //     Distance is the pythagorean distance (sqrt) between the
   //     unknown and a known-class text; we choose closest.  (this

alloc(size) vs alloc(size + 1) - not a guestimate but a Sure Thing - part 2 + CRM_NOCASE

Recall the alloc(size+1) issue above to enable the code to legally store the sentinel in worst case conditions? Here's the second one for that same purpose.

You may safely assume (grep the code for variable usage to make sure) that unk_hashcount++; statement is bogus and unused.

@@ -1065,10 +1175,10  
@@
     fprintf (stderr, "executing a CLASSIFY\n");
   
   //        make the space for the unknown text's hashes
-  unk_hashes = calloc (HYPERSPACE_MAX_FEATURE_COUNT, 
+    unk_hashes = calloc(HYPERSPACE_MAX_FEATURE_COUNT + 1 /* still space for sentinel at worst case! */,
           sizeof (HYPERSPACE_FEATUREBUCKET_STRUCT));
   unk_hashcount = 0;
-  unk_hashcount++;
+    // unk_hashcount++;
 
   //           extract the variable name (if present)
   //    (we now get those fromt he caller)
@@ -1112,7 +1222,9   if (apb->sflags & CRM_NOCASE)
     {
-      cflags = REG_ICASE | cflags;
+        if (user_trace)
+            fprintf(stderr, " setting NOCASE for tokenization\n");
+        cflags |= REG_ICASE;
       eflags = 1;
     };
 

Ah, almost forgot: the added user_trace for completeness sake: the other flags get honorable mention in -t mode, so why not NOCASE? Besides, the '|=' operator is a nice and readable shorthand.

Boundary checking is Good, but do it timely

Here is a boundary which already existed in the old code, but it unfortunately came barging in after the show was already over. First check, then act. (But then, my mentor and CEO for 9 years, Jack van Praag, has taught me people generally forget the check and are rather quick on the act. Wham, bam! Thanks you, ma'm! Boy, was/is he right on the money there.)

For the old check that got moved up here, see this same section a little further down.

@@ -1215,7 +1327,15  
@@
        }
          else
        {
-         //  file exists - do the open/process/close      
+                   // [i_a] check hashes[] range BEFORE adding another one!
+            if (maxhash >= MAX_CLASSIFIERS)
+            {
+                nonfatalerror("Too many classifier files.",
+                        "Some may have been disregarded");
+            }
+           else
+           {
+                    //  file exists - do the mmap
          //    
          hashlens[maxhash] = statbuf.st_size;
          //  mmap the hash file into memory so we can bitwhack it

and a bit of that error checking - never out of fashion with the elite (who do not spell as l33t):

@@ -1254,9 +1374,15  
@@
              if (user_trace)
                fprintf (stderr, "Read-opening file %s\n", fname);
              hashf= fopen (fname, "rb");
+                            if (hashf == 0)
+                            {
+                                fatalerror("For some reason, I was unable to read-open the Hyperspace file named ",
+                                        fname);
+                            }
              fread (hashes[maxhash], 1, hashlens[maxhash], hashf);
              fclose (hashf);
            };
+                        }
              
              //  set this hashlens to the length in features instead
              //  of the length in bytes.

string which mentions 'malloc' --> 'alloc' and, ta!da!, another case of uncrustify cleanup: '\000' is a very elaborate way of saying: 0

Autocleaned by uncrustify. I love it.

And finally, over at the end, the boundary check which was moved up.

@@ -1264,17 +1390,19  
@@
            / sizeof ( HYPERSPACE_FEATUREBUCKET_STRUCT );
              hashname[maxhash] = (char *) malloc (fnlen+10);
              if (!hashname[maxhash])
+                        {            untrappableerror(
-                    "Couldn't malloc hashname[maxhash]\n","We need that part later, so we're stuck.  Sorry.");
+                                    "Couldn't alloc hashname[maxhash]\n", "We need that part later, so we're stuck.  Sorry.");
+                        }
+                        else
+                        {
              strncpy(hashname[maxhash],fname,fnlen);
-             hashname[maxhash][fnlen]='\000';
+                            hashname[maxhash][fnlen] = 0;
+                        }
              maxhash++;
            };
        };
        };
-     if (maxhash > MAX_CLASSIFIERS-1)
-       nonfatalerror5 ("Too many classifier files.",
-               "Some may have been disregarded", CRM_ENGINE_HERE);
    };
     };
 
 

The USE_FIXED_UNIQUE_MODE unique mode fix - part 2

@@ -1492,6 +1620,45  
@@
   //     we can do fast comparisons against each document's hashes in
   //     the hyperspace vector files.
 
+#if USE_FIXED_UNIQUE_MODE
+//    CRM_ASSERT(unk_hashcount >= 0);
+//    CRM_ASSERT(unk_hashcount < HYPERSPACE_MAX_FEATURE_COUNT);
+    //mark the end of a feature vector
+    unk_hashes[unk_hashcount].hash = 0;
+
+
+  qsort (unk_hashes, unk_hashcount, sizeof (HYPERSPACE_FEATUREBUCKET_STRUCT),
+    &hash_compare);
+
+    if (user_trace)
+        fprintf(stderr, "Total hashes in the unknown text: %ld\n", unk_hashcount);
+
+
+    //       uniqueify the hashes array.
+    if (use_unique)
+    {
+//        CRM_ASSERT(unk_hashcount >= 0);
+//        CRM_ASSERT(unk_hashcount < HYPERSPACE_MAX_FEATURE_COUNT);
+//        CRM_ASSERT(unk_hashes[unk_hashcount].hash == 0);
+
+         for (i = j = 1; i < unk_hashcount; i++)
+         {
+             if (unk_hashes[i].hash != unk_hashes[j - 1].hash)
+             {
+                 unk_hashes[j] = unk_hashes[i];
+                 j++;
+             }
+         }
+         unk_hashcount = j;
+
+        //mark the end of a feature vector
+        unk_hashes[unk_hashcount].hash = 0;
+    }
+
+//    CRM_ASSERT(unk_hashcount >= 0);
+//    CRM_ASSERT(unk_hashcount < HYPERSPACE_MAX_FEATURE_COUNT);
+//    CRM_ASSERT(unk_hashes[unk_hashcount].hash == 0);
+#else
   qsort (unk_hashes, unk_hashcount, sizeof (HYPERSPACE_FEATUREBUCKET_STRUCT),
     &hash_compare);
 
@@ -1518,9 +1685,12  

... and some fancy footwork in closing

This one is only marked by a comment line, but it's a bug for sure.

However, if you ever want to to get your megatest numbers matching Bill's, betting keep this in while #define USE_FIXED_UNIQUE_MODE 0 / #undef USE_FIXED_UNIQUE_MODE is active. When you switch to the new code using #define USE_FIXED_UNIQUE_MODE 1, this will be discarded anyhow, so no sweat!

@@      i++;
    };
       j--;
+    // [i_a] GROT GROT GROT: with the next line, if 'unique' was specified, totalfeatures will end up being the total number of features MINUS ONE!
       unk_hashcount = j;
     }
 
+#endif
+
   if (user_trace)
     fprintf (stderr, "unique hashes generated: %ld\n", unk_hashcount);
 
@@ -1557,7 +1727,8 #ifdef KNN_ON
     //      Initialize the KNN neighborhood.
-    for (i = 0; i < KNN_NEIGHBORHOOD_SIZE; i++);
+        for (i = 0; i < KNN_NEIGHBORHOOD_SIZE; i++)
+            ;
     {
       top_n_val[i] = 0.0;
       top_n_class[i] = -1;

Ah, a little reformatting of an 'empty' for() loop there: more readable code for ...

svlen and the pR result

Now this is an interesting one: in fringe cases, where svlen == 0, you will not see a proper pR being calculated. Which causes 'classify' to behave different then than when used under other circumstances.

If you don't see it, track the assignments to the variable tprob throughout the source file and see where it is used and for what purposes.

Hence my fix to make sure the pR calculation is done at all times ... even when the result is not stored in a CRM script variable.

@@ -1962,7 +2133,11  
@@
   tprob = 0.0;
   for (k = 0; k < succhash; k++)
     tprob = tprob + ptc[k];
-  if (svlen > 0)
+    //
+    //      Do the calculations and format some output, which we may or may
+    //      not use... but we need the calculated result anyway.
+    //
+    if (1 /* svlen > 0 */)
     {
       char buf[1024];
       double accumulator;
@@ -1994,7 +2169,7  

More trailing zeroes doesn't make it more precise

0.50000 doesn't decode into a more accurate floating point representation of said number. uncrustify agrees - this is another autocleanups demonstrated.

@@       //  There would be a possible buffer overflow except that _we_ control
       //   what gets written here.  So it's no biggie.
 
-      if (tprob > 0.5000)
+        if (tprob > 0.5)
    {
      sprintf (buf, "CLASSIFY succeeds; success probability: %6.4f  pR: %6.4f\n", tprob, overall_pR );
    }
@@ -2013,7 +2188,15  

sprintf() considered harmful - use snprintf() instead

Ever printf()ed %s content which caused a buffer overflow in your statically or dynamically allocated memory space? No? Strange, multitudes of web servers and others have been subject to this (and other) 'buffer overflow' attacks.

Of course, for maximum kudos, make sure you're aware snprintf() needs the buffer size in 'maximum number of elements', which is not the same as the number of bytes reported by sizeof() when you use anything else then basic 1970's 'char' type. Imagine crm114 is ever made Unicode (or wchar_t) aware: when's your sizeof() going to save you then? Hence use the 'number of elements' custom macro WIDTHOF() - of course available in the full GerH distro crm114_sysinclude.h portability header file. <plug />

@@      {
        remainder = remainder + ptc[m];
      };
-      sprintf (buf, "Best match to file #%ld (%s) "\
+        /* CRM_ASSERT(bestseen < maxhash); ** [i_a] this one was triggered in a zero file error condition
+         *                                    due to the nonfatalness of the maxhash==0 error report. */
+
+        //   ... and format some output of best single matching file
+        //
+        buf[0] = 0;
+        if (bestseen < maxhash)
+        {
+            snprintf(buf, sizeof(buf), "Best match to file #%ld (%s) " // GROT GROT GROT should be WIDTHOF(buf) there but that'll have to wait till those macros are introduced :-(
                "prob: %6.4f  pR: %6.4f  \n",
           bestseen,
           hashname[bestseen],
@@ -2025,6 +2208,8  

and snprintf() isn't the Grail either: quite a few systems out there (some even documented) do not write the NUL character as C-string sentinel when snprintf() runs out of space due to specified buffer limitations. So better make sure and postcede every snprintf() call with a line like the one below, where the NUL character sentinel is written in the very last character slot in the snprintf() destination buffer.

@@           //    rescaled yet again for pR from 1500 to 150
           // 25 * (log10(ptc[bestseen]) - log10(remainder))
           );
+            buf[sizeof /* WIDTHOF */ (buf) - 1] = 0;
+        }
       if (strlen (stext) + strlen(buf) <= stext_maxlen)
    strcat (stext, buf);
       sprintf (buf, "Total features in input file: %ld\n", totalfeatures); 
@@ -2039,8 +2224,10  

And another snprintf():

@@        {
          remainder = remainder + ptc[m];
        };
-     sprintf (buf, 
-          "#%ld (%s):"\
+//            CRM_ASSERT(k >= 0);
+//            CRM_ASSERT(k < maxhash);
+            snprintf(buf, sizeof /* WIDTHOF */ (buf),
+          "#%ld (%s):"
           " features: %ld, hits: %ld, radiance: %3.2e, prob: %3.2e, pR: %6.2f \n", 
           k,
           hashname[k],
@@ -2049,6 +2236,7  

with postceding NUL sentinel write (as above):

@@           class_radiance[k],
           ptc[k], 
           10 * (log10 (ptc[k]) - log10 (remainder) )  );
+            buf[sizeof /* WIDTHOF */ (buf) - 1] = 0;
      //  Rescaled for +/- 10 pR units optimal thick threshold
      //250 * (log10 (ptc[k]) - log10 (remainder) )  );
      //    rescaled yet again for pR from 1500 to 150
@@ -2068,9 +2256,12  

Since the if (svlen > 0) condition was violated in the fix above, it's only opportune to check that here, before we go off and attempt to patch the result into a CRM script target variable.

@@             "values for MAX_CLASSIFIERS or MAX_FILE_NAME_LEN?",
             " ", CRM_ENGINE_HERE);
    };
+        if (svlen > 0)
+        {
       crm_destructive_alter_nvariable (svrbl, svlen, 
                       stext, strlen (stext));
     };
+    }
   
 
   //  cleanup time!
@@ -2135,3 +2326,27  

And with this we're at the end of our Hyperspace travels: the stub

A code section similar to the one provided with Neural Net (and ALL classifiers in the GerH distro: there you can disable/remove any set of classifiers at compile time): this provides a nice, script error reporting stub for when this classifier should not be included in the build:

@@  regcomp_failed:
   return (0);
 };
+
+#else /* CRM_WITHOUT_OSB_HYPERSPACE */
+
+int crm_expr_osb_hyperspace_learn(CSL_CELL *csl, ARGPARSE_BLOCK *apb,
+        char *txtptr, int txtstart, int txtlen)
+{
+    return nonfatalerror_ex(SRC_LOC(),
+            "ERROR: the %s classifier has not been incorporated in this CRM114 build.\n"
+            "You may want to run 'crm -v' to see which classifiers are available.\n",
+            "OSB-Hyperspace");
+}
+
+
+int crm_expr_osb_hyperspace_classify(CSL_CELL *csl, ARGPARSE_BLOCK *apb,
+        char *txtptr, int txtstart, int txtlen)
+{
+    return nonfatalerror_ex(SRC_LOC(),
+            "ERROR: the %s classifier has not been incorporated in this CRM114 build.\n"
+            "You may want to run 'crm -v' to see which classifiers are available.\n",
+            "OSB-Hyperspace");
+}
+
+#endif /* CRM_WITHOUT_OSB_HYPERSPACE */
+

Some last minute TRE interface fixups [crmregex_tre.c]

iff -u -EbwB --strip-trailing-cr src/crm114.sourceforge.net/src/crmregex_tre.c src/patched/crm114.sourceforge.net/src/crmregex_tre.c
--- src/crm114.sourceforge.net/src/crmregex_tre.c   2008-03-26 02:50:00.000000000 +0100
+++ src/patched/crm114.sourceforge.net/src/crmregex_tre.c   2008-03-26 20:26:56.000000000 +0100

Just another way to write the same - #if defined(xxx) is more to type, but I generally use it as a coding standard: never trouble when you add another check like '&& defined(YYY)': copy&paste won't get you in trouble then.

More a coding standard sort of thing, this is.

@@ -90,7 +90,7  
@@     if (internal_trace) fprintf (stderr, "Checking the regex cache for %s\n",
                 regex);
     j = 0;
-#ifdef REGEX_CACHE_LINEAR_SEARCH
+#if defined (REGEX_CACHE_LINEAR_SEARCH)
     //
     //          Linear Search uses a strict LRU algorithm to cache
     //          the precompiled regexes.
@@ -111,11 +111,10  

If the change/fix provided by Paolo ~ 2008/03/26 is not to be, it's at least mandatory to apply this (previously reported) fix instead: assume you've got a match with the first entry in the cache (i == 0), while 'found_it' is used as a 0/1 boolean to signal success and you'll quickly see why this assignment has to be '1'(ONE) instead of 'i'(AIE).

Sometimes, the choice of display font is important when you look at this thing when tired. ;-)

@@        rlen_temp   = regex_len;
        cflags_temp = cflags;
        status_temp = regex_cache[i].status;
-       found_it = i;
+       found_it = 1;
      };
       };
-#endif
-#ifdef REGEX_CACHE_RANDOM_ACCESS
+#elif defined (REGEX_CACHE_RANDOM_ACCESS)
     //
     //             Random Access uses an associative cache based on 
     //             the hash of the regex (mod the size of the cache).
@@ -133,8 +132,10  

The #endif / #ifdef is replaced by a more appropriate #elif: you don't want to include both code sections, even when some moron screws up the config header file and defines both RANDOM and LINEAR for kicks.

And before you run, here's the second instance of that AIE vs ONE fixup for cache hit detection and processing:

@@    rlen_temp   = regex_len;
    cflags_temp = cflags;
    status_temp = regex_cache[i].status;
-   found_it = i;
+   found_it = 1;
       };
+#else
+#error "Must have #define'd ONE of these: REGEX_CACHE_RANDOM_ACCESS, REGEX_CACHE_LINEAR_SEARCH"
 #endif
     
     //    note that on exit, i now is the index where we EITHER found
@@ -170,7 +171,7  

Coding standard stuff:

@@     //     at index i, 
 
 
-#ifdef REGEX_CACHE_LINEAR_SEARCH
+#if defined (REGEX_CACHE_LINEAR_SEARCH)
     //   If we're in linear search, we move 0 through i-1 down to 1
     //   through i and then we stuff the _temp vars into the [i] cache
     //   area.  Note that if it was the final slot (at
@@ -213,9 +214,7  

and the same, though here it's a bit more than just 'either way': use of #elif also implies we, the coders, do know hat we're trying to achieve here: either/or. Not both.

@@     regex_cache[0].regex_len = rlen_temp;
     regex_cache[0].status    = status_temp;
     regex_cache[0].cflags    = cflags_temp;
-#endif
-
-#ifdef REGEX_CACHE_RANDOM_ACCESS
+#elif defined (REGEX_CACHE_RANDOM_ACCESS)
     //
     //      In a random access system, we just overwrite the single
     //      slot that we expected our regex to be in...
@@ -241,6 +240,8  

And an #error report when someone triggers a condition which you didn't foresee.

("this can't happen" ... right... don't need interns for that to be proven a falsehood. Give me a bunch of 'professionals' any time and you're in the crapper there, debugging someone else's builds which miraculously don't do what you want them to.)

@@     regex_cache[i].status    = status_temp;
     regex_cache[i].cflags    = cflags_temp;
 
+#else
+#error "Must have #define'd ONE of these: REGEX_CACHE_RANDOM_ACCESS, REGEX_CACHE_LINEAR_SEARCH"
 #endif
 
     //  Just about done.  Set up the return preg..
@@ -278,11 +279,19  

scanf() checking revisited

@@       regaparams_t pblock;
       mblock.nmatch = nmatch;
       mblock.pmatch = pmatch;
-      sscanf (aux_string, "%d %d %d %d", 
+        pblock.cost_subst = 0;
+        pblock.cost_ins = 0;
+        pblock.max_cost = 0;
+        pblock.cost_del = 0;
+        if (4 != sscanf(aux_string, "%d %d %d %d",
          &pblock.cost_subst,
          &pblock.cost_ins,
          &pblock.max_cost,
-         &pblock.cost_del);
+                &pblock.cost_del))
+        {
+            fatalerror("Failed to decode 4 numeric cost parameters for approximate matching ",
+                aux_string);
+        }
       if (user_trace)
    fprintf (stderr,
     "Using approximate match.  Costs: Subst %d Ins %d Max %d Del %d \n",
@@ -314,11 +323,11  

See above for a similar case.

void functions don't return values

This code line is inactive, unless one turns off the regex cache alltogether. When they do, they'll be treated to a compiler error.

And about the second one: C functions should list 'void' as argument list if they've got nothing to get from anybody.

@@   //  till and unless we decache, so crm_regfree is a noop.
   return;
 #else
-   return (regfree (preg));
+    regfree(preg);
 #endif
 };
 
-char * crm_regversion ()
+char *crm_regversion(void)
 {
   static char vs[129];
   strcat (vs, (char *) tre_version ());

Section X: [crm_vector_tokenize.c]

diff -u -EbwB --strip-trailing-cr src/crm114.sourceforge.net/src/crm_vector_tokenize.c src/patched/crm114.sourceforge.net/src/crm_vector_tokenize.c
--- src/crm114.sourceforge.net/src/crm_vector_tokenize.c    2008-03-26 02:51:36.000000000 +0100
+++ src/patched/crm114.sourceforge.net/src/crm_vector_tokenize.c    2008-03-26 20:26:56.000000000 +0100
@@ -34,6 +34,18  

Here's you're elusive custom WIDTHOF() macro - only as an example and for ease of use inside this code; should be in crm114_sysincludes.h however - like it is in the full GerH.

@@ extern char *outbuf;
 extern char *tempbuf;
 
+
+
+
+#ifndef WIDTHOF
+/* like sizeof() but this returns the number of /elements/ instead of bytes: */
+#define WIDTHOF(item)                  (sizeof(item) / sizeof((item)[0]))
+#endif
+
+
+
+
+
 ///////////////////////////////////////////////////////////////////////////
 //
 //    This code section (from this comment block to the one declaring
@@ -408,9 +420,6  

Locality of declaration

One does not have to declare all variables at the start of the function. Better to declare them in the smallest/deepest scope block, so their existence is not abused by assuming code in other spots.

@@   char *my_regex;
   long my_regex_len;
 
-  char s1text[MAX_PATTERN];
-  long s1len;
-
 
  // For slash-embedded pipeline definitions.
   long ca[UNIFIED_WINDOW_LEN * UNIFIED_VECTOR_LIMIT]; 
@@ -425,10 +434,10  

here's the matching part from further down the diff:

@@ -501,6 +511,9  
@@
   //    override this if a regex was passed in)
   if (! regex)
     {
+       int s1len;
+    char s1text[MAX_PATTERN]; 
+
       crm_get_pgm_arg (s1text, MAX_PATTERN, apb->s1start, apb->s1len);
       s1len = apb->s1len;
       s1len = crm_nexpandvar (s1text, s1len, MAX_PATTERN);

Why WIDTHOF() is useful

And here's the WIDTHOF() in use. Nothing changes for now, until you decide to change those constant tables contents: the WIDTHOF() calculus here ensures you only have to adapt one piece of code: the tables themselves. With WIDTHOF() used like this, it saves you from having to code-inspect and scan all the source(s) for any particular related constants.

@@   featurebits = 32;
   hash_vec0 = osb1_coeff;
   hash_len0 = OSB_BAYES_WINDOW_LEN;    // was 5
-  hash_iters0 = 4; // should be 4
+  hash_iters0 = WIDTHOF(osb1_coeff) / OSB_BAYES_WINDOW_LEN; // should be 4
   hash_vec1 = osb2_coeff;
   hash_len1 = OSB_BAYES_WINDOW_LEN;     // was 5
-  hash_iters1 = 4; // should be 4
+  hash_iters1 = WIDTHOF(osb2_coeff) / OSB_BAYES_WINDOW_LEN; // should be 4
   output_stride = 1;
 
   //    put in the passed-in regex values, if any.

and some more:

@@ -443,7 +452,8  
@@
     {
       hash_vec0 = markov1_coeff;
       hash_vec1 = markov2_coeff;
-      hash_iters0 = hash_iters1 = 16;
+        hash_iters0 = WIDTHOF(markov1_coeff) / OSB_BAYES_WINDOW_LEN /* 16 */;
+        hash_iters1 = WIDTHOF(markov2_coeff) / OSB_BAYES_WINDOW_LEN /* 16 */;
     };
 
   //     If it's one of the 64-bit-key classifiers, then the featurebits

VT: output_stride for non-32bit step size

@@ -632,7 +645,7  
@@
               hash_iters0,
               features,
               featureslen,
-              1,           //  stride 1 for 32-bit
+                output_stride,                      //  stride 1 for 32-bit
               features_out,
               next_offset);
     }

The above is equivalent, but the change mimicks the one below, where the value of output_stride != 1: when we are looking at dual-hash systems (h1 + h2), VT should generate both series by interleaving them into the 'features_out' array.

If, however, you would always use a step-distance of 1, that would mean both VT runs below would overwrite each other results - except the first and the last hash value produced - and the calling routine would be amiss in assuming a properly interleaved output as half-1 of that would be garbage, while the first half would be one hash for h1 and the rest a continuous (stepsize hardcoded as 1 in the old code!) sequence of h2 hashes.

@@ -650,21 +663,21  
@@
               hash_iters0,
               features,
               featureslen,
-              1,           //  stride 1 for 32-bit
+                output_stride,                      //  stride 2 for 64-bit, offset 0
               features_out,
               next_offset);
       crm_vector_tokenize (
               text,
               textlen,
               start_offset,
-              regex,
-              regexlen,
+                my_regex,
+                my_regex_len,
               hash_vec1,
               hash_len1,
               hash_iters1,
-              &(features[1]),
+                &features[1],
               featureslen,
-              1,           //  stride 1 for 32-bit
+                output_stride,                      //  stride 2 for 64-bit, offset 1 thanks to &features[1]
               features_out,
               next_offset);
     };
@@ -762,7 +775,7  

Besides that, the second call does not use the properly assigned my_regex/_len pair, but the initial function argument from the callee, which might very well be NULL/0, resulting in another sort of disaster.

@@       regexlen,
       osb1_coeff,
       OSB_BAYES_WINDOW_LEN, 
-      4,  // should be 4
+      WIDTHOF(osb1_coeff) / OSB_BAYES_WINDOW_LEN,  // should be 4
       features,
       featureslen,
       2,
@@ -792,7 +805,7  

15 / 16

All this of course intermingled with some WIDTHOF() use and one where it counts: string_kern_len should be 16, given the current situation of the relevant constants table. Or am I mistaken in assuming the complete list of constants should be available here, if one so desires?

@@       regexlen,
       osb2_coeff,
       OSB_BAYES_WINDOW_LEN, 
-      4,  // should be 4
+      WIDTHOF(osb2_coeff) / OSB_BAYES_WINDOW_LEN,  // should be 4
       features,
       featureslen,
       2,
@@ -818,7 +831,8   //    The coeffs should be relatively prime.  Relatively...
 
-  if (string_kern_len > 15) string_kern_len = 15;
+    if (string_kern_len > WIDTHOF(string1_coeff))
+        string_kern_len = WIDTHOF(string1_coeff);
 
   return crm_vector_tokenize
     ( text,
@@ -851,7 +865,8   //    The coeffs should be relatively prime.  Relatively...
 
-  if (string_kern_len > 15) string_kern_len = 15;
+    if (string_kern_len > WIDTHOF(string2_coeff))
+        string_kern_len = WIDTHOF(string2_coeff);
 
   return crm_vector_tokenize
     ( text,

Section Last: the differences in megatest results

These are left as an exercise for the reader.

diff -u -EbwB --strip-trailing-cr src/crm114.sourceforge.net/src/megatest_test.log src/patched/crm114.sourceforge.net/src/megatest_test.log
--- src/crm114.sourceforge.net/src/megatest_test.log    2008-03-27 03:25:15.000000000 +0100
+++ src/patched/crm114.sourceforge.net/src/megatest_test.log    2008-03-27 03:23:52.000000000 +0100
@@ -205,8 +205,8  
@@  ----- keeping a process around ----  
 
  preparing... 
- OK_IF_PID_CHANGES: one... MINION PROC PID: 18153 from-pipe: 6 to-pipe: 5
- OK_IF_PID_SAME_AS_ABOVE: again... MINION PROC PID: 18153 from-pipe: 6 to-pipe: 5
+ OK_IF_PID_CHANGES: one... MINION PROC PID: 17551 from-pipe: 6 to-pipe: 5
+ OK_IF_PID_SAME_AS_ABOVE: again... MINION PROC PID: 17551 from-pipe: 6 to-pipe: 5
  and done ...DEAD MINION, EXIT CODE: 0 
 exit value test got DEAD MINION, EXIT CODE: 123
  CRM114: Testing union and intersection 
@@ -978,26 +978,26 
 **** Neural Network Classifier 
-./megatest.sh: line 461: 18353 Segmentation fault      ./crm114 '-{learn < neural append > (q_test.css) /[[:graph:]]+/}' <QUICKREF.txt
-./megatest.sh: line 462: 18354 Segmentation fault      ./crm114 '-{learn < neural append > (i_test.css) /[[:graph:]]+/}' <INTRO.txt
-./megatest.sh: line 463: 18355 Segmentation fault      ./crm114 '-{learn < neural refute fromstart > (q_test.css) }' <INTRO.txt
-./megatest.sh: line 464: 18357 Segmentation fault      ./crm114 '-{learn < neural refute fromstart > (i_test.css) }' <QUICKREF.txt
+./megatest.sh: line 461: 17897 Segmentation fault      ./crm114 '-{learn < neural append > (q_test.css) /[[:graph:]]+/}' <QUICKREF.txt
+./megatest.sh: line 462: 17898 Segmentation fault      ./crm114 '-{learn < neural append > (i_test.css) /[[:graph:]]+/}' <INTRO.txt
+./megatest.sh: line 463: 17899 Segmentation fault      ./crm114 '-{learn < neural refute fromstart > (q_test.css) }' <INTRO.txt
+./megatest.sh: line 464: 17900 Segmentation fault      ./crm114 '-{learn < neural refute fromstart > (i_test.css) }' <QUICKREF.txt
  type Q 
 CLASSIFY fails; success probability: 0.500000  pR: 0.0000
 Best match to file #0 (i_test.css) prob: 0.5000  pR: 0.0000  
-Total features in input file: 268
+Total features in input file: 0
 #0 (i_test.css): prob: 5.00e-01, pR:   0.00
 #1 (q_test.css): prob: 5.00e-01, pR:   0.00
 
  type Q 
 CLASSIFY fails; success probability: 0.500000  pR: 0.0000
 Best match to file #0 (i_test.css) prob: 0.5000  pR: 0.0000  
-Total features in input file: 372
+Total features in input file: 0
 #0 (i_test.css): prob: 5.00e-01, pR:   0.00
 #1 (q_test.css): prob: 5.00e-01, pR:   0.00
 
  Alternating Example Neural Network Classifier TRAINING 
-./megatest.sh: line 492: 18362 Segmentation fault      ./crm114 alternating_example_neural.crm
+./megatest.sh: line 492: 17905 Segmentation fault      ./crm114 alternating_example_neural.crm
 
 **** Support Vector Machine (SVM) unigram classifier 
  type I 
@@ -1080,7 +1080,7 
 **** Clump / Pmulc Test 
-../megatest.sh: line 585: 18407 Segmentation fault      ./crm114 '-{ match <fromend> (:one_paragraph:) /([[:graph:]]+.*?\n\n){5}/; clump <bychunk> [:one_paragraph:] (i_test.css) /[[:graph:]]+/; output /./ ; liaf}' <INTRO.txt
+../megatest.sh: line 585: 17965 Segmentation fault      ./crm114 '-{ match <fromend> (:one_paragraph:) /([[:graph:]]+.*?\n\n){5}/; clump <bychunk> [:one_paragraph:] (i_test.css) /[[:graph:]]+/; output /./ ; liaf}' <INTRO.txt
 ..................Likely result: 
 PMULC succeeds; success probabilty: 1.0000 pR: 66.5468
 Best match to clump #4 (clump_#4) prob: 1.0000  pR: 66.5468

Conclusion & Closing statements

Several diffs can be ranged in the Coding Standards or other quick-to-discard lots.

However, what I've tried to show here is what attempting to provide a consistent look and maintaining a high(?) standard of software development may achieve in terms of code robustness. Note: Code/CSS binary portability is not considered here, nor in these diffs; that would require another set of minor changes probably matching this document for size. All I will say about it is this: "long" doesn't make stuff portable. Nor does any other single code / measure. It's all about 'awareness' and that is darn hard to teach.

With that comes the slew of error checking fixes described, which tell only one thing: you are on par with many other professional application software out there in terms of code execution stability. (With sufficient sarcasm this may be considered very high praise indeed.)

Training people to understand and stick to an failure-scenarios aware mentality is a hard job and I don't think I did a good job at that this time - too much tongue in cheek for one. When this software ever gets the good fortune to merit the special care and attention of internet hackers, one can but pray.

Often 'code quality' efforts get bogged down in setting up teams and writing coding standards providing company-wide legislation while enforcing severe implementation vocabulary restrictions - in other words a significantly 'dumbed down' programming vocabulary. It is the natural way (recall 'Wham bam, thank you, ma'm!") for people to react to this kind of challenge. I hope I did not fall into that tar pit, and I sure hope nobody else will follow up that way.

Because the only way is up!

And that's more than just a ridiculous song text quote: I think I learned a few things so far (and didn't hear some other teachings at the same time) and I hope this does teach some others something too.

One last stab before I go: I expect this to pass any code vetting, though eyebrows may be raised for quite an amount of time nevertheless.

Do not expect me to write up a simple diff like this so extensively ever again: this is a one-time offer only, because this write up, my friends, eats up time like nothing else. But at least this time, it cuts out quite a bit of the doubt and wonder.

Exit stage left

Now I'll go and listen to some more very nice kwed.org remixes - I might just find a few more that tweak that love and excitement knob in my head. Highly recommended:

Joy!