The CRM114 Quick Reference Card. Updated 20080505 Copyright W.S. Yerazunis, 2002-2008. All rights reserved. This software is released under V2.1 of the Gnu Public License. Go to www.fsf.org to get a complete copy of the license. This is the CRM114 Language Quick Reference. For information on the mailfilter, see the CRM114_Mailfilter_HOWTO. ----- THE COMMAND LINE ------------- Invoke as 'crm whatever' or use '#!/usr/bin/crm' as the first line of a script file containing the program text. Command Line Options: -{statements} - execute the statements inside the {} brackets -b N - sets a breakpoint on statement N -d N - run N cycles, then drop into debugger. If no N, debug immediately -e - no environment variables imported -E N - set engine runtime exit base value N -h - print help text -H N - select hash function N - handle this with the utmost care! (default=0) -l N - print a listing (detail level 1 through 5) -m N - max number of microgroomed buckets in a chain -M N - max chain length - triggers microgrooming if enabled -p - generate an execution-time-spent profile on exit -P N - max program lines @ 128 chars/line -q N - math mode (0,1 alg/RPN in EVAL,2,3 alg/RPN everywhere) -r N - set OSBF min pmax/pmin ratio (default=9) -s N - new sparse spectra feature file (.css) size is N (default 1 meg+1 featureslots) -S N - new feature file (.css) size is N rounded up to 2^I+1 featureslots -C - use environment locale (default POSIX) -t - give user level execution trace output -T - implementers trace output (only for the masochistic!) -u dir - chdir to directory dir before starting execution -v - print CRM114 version identification and exit -w N - max data window (bytes, default 16 megs) -- - signals the end CRM114 flags; prior flags are not seen by the user program; subsequent args are not processed by CRM114. --foo - creates the user variable :foo: with the value SET --x=y - creates the user variable :x: with the value y -in file - use file instead of stdin for input. Note that '-in' may specify the standard handle value '0' for stdin. Accepted as 'stdin' representatives: 0 - stdin /dev/stdin CON: /dev/tty NOT allowed are the stdout/stderr handle values 1 and 2: crm114 will report an error when you do. -out file - use file instead of stdout for output. See '-err' for ways to specify stdout/stderr. -err file - use file instead of stderr for output. Note that '- out' may use the same file as '-err'. Note also that '-out' and '-err' may specify the standard handle values '1' for stdout and '2' for stderr. This implies that '-err 1' is essentially identical to the UNIX shell '2>&1' redirection, though without the buffer delays that would otherwise occur when you mix stdout and stderr output to a single channel. Accepted as default stdio channel (stdout for -out, stderr for -err) representatives: - CON: /dev/tty Accepted as EXPLICIT 'stdout' representatives: 1 stdout /dev/stdout and likewise for explicit stderr: stderr /dev/stderr NOT allowed is the stdin handle value 0: crm114 will report an error when you do. -Cdbg - direct developer support: trigger the C/IDE debugger when an internal error is hit. WARNING: only available when 'crm114 -v' reports crm114 was built with assertions enabled in the code. Absent the -{ program } flag, the first arg is taken to be the name of a file containing a crm114 program, subsequent args are merely supplied as :_argN: values. Use single quotes around command line programs '-{ like this }' to prevent the shell from doing odd things to your command-line programs. CRM114 can be directly invoked by the shell if the first line of your program file uses the shell standard, as in: #! /usr/bin/crm You can use CRM114 flags on the shell-standard invocation line, and hide them with '--' from the program itself; '--' incidentally prevents the invoking user from changing any CRM114 invocation flags. Flags should be located after any positional variables on the command line. Flags _are_ visible as :_argN: variables, so you can create your own flags for your own programs (separate CRM114 and user flags with '--'). Examples: ./foo.crm bar mugga < baz -t -w 150000 <--- Use this ./foo.crm -t -w 1500000 -- bar < baz mugga <--- or this ./foo.crm -t -w 150000 bar < baz mugga <--- NOT like this You can put a list of user-settable vars on the '#!/usr/bin/crm' invocation line. CRM114 will print these out when a program is invoked directly (e.g. "./myprog.crm -h", not "crm myprog.crm -h") with the -h (for help) flag. (note that this works ONLY on Linux and Darwin - FreeBSD and Solaris have a different implementations and this doesn't work. Don't use this in programs that need to be portable) Example: #!/usr/bin/crm -( var1 var2=A var2=B var2=C ) - allows only var1 and var2 be set on the command line. If a variable is not assigned a value, the user can set any value desired. If the variable is equated to a set of values, those are the _only_ values allowed. #!/usr/bin/crm -( var1 var2=foo ) -- - allows var1 to be set to any value, var2 may only be set to either "foo" or not at all, and no other variables may be set nor may invocation flags be changed (because of the trailing "--"). Since "--" also blocks '-h' for help, such programs should provide their own help facility. ----- VARIABLES ---------- Variable names and locations start with a : , end with a : , and may contain only characters that have ink (i.e. the [:graph:] class) with a few exceptions- basically, no embedded ':' characters. They are case sensitive. Examples :here: , :ThErE:, :every-where_0123+45%6789: , :this_is_a_very_very_long_var_name_that_does_not_tell_us_much: . Builtin variables: :_nl: - newline :_ht: - horizontal tab :_bs: - backspace :_sl: - a slash :_sc: - a semicolon :_arg0: thru :_argN: - command-line args, including _all_ flags :_argc: - how many command line arguments there were :_pos0: thru :_posN: - positional args ('-' or '--' args deleted) :_posc: - how many positional arguments there were :_pos_str: - all positional arguments concatenated :_env_whatever: - environment value 'whatever' :_env_string: - all environmental arguments concatenated :_crm_version: - the version of the CRM system :_cd: - the current call depth :_cs: - the current statement number :_pgm_hash: - hash of the current program - for version verification :_pgm_text: - copy of post-processed source code - matchable :_pid: - process ID of the current process. :_ppid: - process ID of the parent of the current process. :_dw: - the current data window contents (usually the default arg) :_iso: - the current isolated data block (change at your own peril!) :_?: - watchable variable: last (trappable) failure report message; ONLY AVAILABLE when the debugger will be used (-d commandline switch / 'debug' statement in script) ---- VARIABLE EXPANSION ---- You can use the standard C char constant '\' characters, such as "\n" for newline, as well as escaped hexadecimal and octal characters like \xHH and \oOOO but these are constants, not variables, and cannot be redefined. Variables are expanded by the ':*:' var-expansion operator, e.g. :*:_nl: expands to a newline character. Uninitialized vars evaluate to their text name (and the colons stay). User variables are also expanded with the :*: operator, so :*:foo: expands to whatever value :foo: has. Variables are indirected by the :+: indirection operator; the reason for the :+: operator is that if :foo: contains the name of another variable (such as might happen in a CALL statement), then :*: would only return the name of that other variable, but :+: would return the value in that other variable. Use :+: and :*:_cd: to get proper isolation in non-tail-recursive variables, like :+:foo_:*:_cd:: to get the value of a recursively labeled foo_0, foo_1, foo_2, etc. Depending on the value of "math mode" (flag -q). you can also use :#:string_or_var: to get the length of a string, and :@:string_or_var: to do basic mathematics and inequality testing, either only in EVALs or for all var-expanded expressions. See "Sequence of Evaluation" below for more details. ----- PROGRAM BEHAVIOR ---- Default behavior is to read all of standard input till EOF into the default data window (named :_dw:), then execute the program (this is overridden if first executable statement is a WINDOW statement). Variables don't get their own storage unless you ISOLATE them (see below), instead variables are start/length pairs indexing into the default data window. Thus, ALTERing an unISOLATEd variable changes the value of the default data buffer itself. This is a great power, so use it only for good, and never for evil. --- STATEMENTS AND STUFF (separate statements with a ';' or with a newline) -- \ - '\' is the string-text escape character. You only _need_ to escape the literal representation of closing delimiters inside var-expanded arguments. You can use the classic C/C++ \-escapes, such as \n, \r, \t, \a, \b, \v, \f, \0, for the ASCII-defined escape sequences, and also \xHH and \oOOO for hex and octal characters, respectively. A '\' as the _last_ character of a line means the next line is just a continuation of this one. A \-escape that isn't recognized as something special isn't an error; you may _optionally_ escape any of these delimiters: > ) ] } ; / # \ and get just that character. A '\' anywhere else is just a literal backslash, so the regex ([abc])\1 is written just that way; there is no need to double-backslash the \1 (although it will work if you do). This is because the first backslash escapes the second backslash, so only one backslash is "seen" at runtime. # this is a comment # and this too \# - A comment is not a piece of preprocessor sugar - it is a -statement- and ends at the newline or at "\#" However, comments _can_ be added to lead or trail other commands as in-line documentation without the need to use a ';' semicolon to separate the statements, e.g.: alius # comment for the alius action here # leading comment \# output /hello/ insert filename insert [expanded_filename] - inserts the file verbatim at this line at compile time. If the file can't be INSERTed, a system-generated FAULT statement is inserted. Use a TRAP to catch this fault if you want to allow program execution to continue without the missing INSERT file. filename - the local (-u applied) file to insert [expanded_filename] - the filename is first expanded against command-line and environment variables. ; - semicolon is a statement separator - unless it's inside delimiters it must be escaped as \; or else it _will_ mark the end of the statement. { } - start and end blocks of statements. Must always be '\' escaped or inside delimiters or these will mark the start/end of a block. noop - no-op statement :label: - define a GOTOable label :label: (:arg:) - define a CALLable label. The args in the CALL statement are concatenated and put into the freshly ISOLATEd var :arg: (:arg:) - var-expanded varname to receive the caller's arguments (usually a MATCH is then done to put locally convenient labels on the args). accept - writes the current data window to standard output; execution continues. alius - if the last bracket-group succeeded, ALIUS skips to end of {} block (a skip, not a FAIL); if the prior group FAILed, ALIUS does nothing. Thus, ALIUS is both an ELSE clause and a CASE statement. alter (:var:) /new-val/ - surgically change value of var to new-val (:var:) - var to change (var-expanded) /new-val/ - value to change to (var-expanded) call /:entrypoint_label:/ call /:entrypoint_label:/ [:arg1: :arg2:... ] call /:entrypoint_label:/ [:arg1: :arg2:... ] (:ret_arg:) - do a routine call on the specified (var-expanded) entrypoint label. Note that the called routine shares all variables (including the data window :_dw:). Return is accomplished with the RETURN statement. /:entrypoint_label:/ - the location to call [:arg1: :arg2: ...] - var-expanded list of args to call. These are concatenated and supplied to the called routine as a single ISOLATEd var, to be used as desired (usually a MATCH parses the arglist as desired, then :*: is used for call-by-value arguments, and :+: indirection is used to retrieve call-by-name arguments). Call-by-value arguments are NOT modifiable by the callee, while call-by-name arguments are modifiable. (:ret_arg:) - this variable gets the returned value from the routine called (if it returns anything). If it had a previous value, that value is overwritten on return. classify (:c1:...|...:cN:) (:stats:) [:in:] /word-pat/ classify (:c1:...|...:cN:) (:stats:) [:in:] /word-pat/ /pR_offset/ classify (:c1:...|...:cN:) (:stats:) [:in:] /word-pat/ /svm-specific controls/ - compare the statistics of the current data window buffer with classfiles c1...cN . In general, class statistics files are NOT portable between different classifiers! - ignore case in word-pat, does not ignore case in actual text (use tr() or the TRANSLATE command to do that on :in: if you want it) - enable the microgroomer to purge less-important information automatically whenever the statistics file gets to crowded. However, this disables certain optimizations that can speed classification. - use unique features only; this improves accuracy while using less memory. Usable with Markov and OSB modes. - use single-word features only; his makes CRM114 almost exactly equivalent to most other Bayesian classifiers. Works with the OSB, Winnow and hyperspace classifiers. - use orthogonal sparse bigram (OSB) features and Markovian classification instead of Markovian SBPH features. OSB uses a subset of SBPH features with about 1/4 the memory and disk needs, and about 4x the speed of full Markovian, with basically the same accuracy. - use the Fidelis Confidence Factor local probability generator. This format is not compatible with the default, but with single-sided threshold training ( typically pR of 10-30 ) achieves the best accuracy yet. - use the Winnow non-statistical classifier and the OSB front-end feature generator. Winnow uses .cow files, which are not compatible with the .css files for the Markovian (default) and OSB classifiers. - use hyperspace matching; each learned document represents a light source in a 4-billion-dimensional hyperspace, and the set of sources that shines most brightly onto the unknown document's hyper-spatial location is the matching class. EXPERIMENTAL!!! - use the bit-entropy classifier. This uses compressibility of the unknown given the prior learned text as a perfect compressor model. No tokenization happens- this classifier works one bit at a time, always. EXPERIMENTAL !!! - use the SVM classifier. This uses SVM (support vector machine) techniques. NB: for now VERY EXPERIMENTAL; OSB or unigram features (default OSB features), 2-class only, generates A_vs_B files. - use the String Kernel SVM. String kernels take one character at a time as token features, but don't use omitted subsections like the OSB feature set. VERY EXPERIMENTAL. 2-class only, generates A_vs_B files. - use a three-layer neural network with stochastic back-propagation training. Use to reinitialize the network neurons to a small random state in case it gets stuck in a (rare) local minimum. VERY EXPERIMENTAL!!! - use the full correlative matcher. Very slow, but capable of matching stemmed words in any language and of matching binary files (:c1: ... - file or files to consider "success" files. The CLASSIFY succeeds if these files as a group match best. if not, the CLASSIFY does a FAIL. | - optional separator. Spaces on each side of the " | " are required. .... :cN:) - optional files to the right of " | " are considered as a group to "fail". If statement fails, execution skips to end of enclosing {..} block, which exits with a FAIL status (see ALIUS for why this is useful). (:stats:) - optional var that will be surgically changed to contain a formatted matching summary. In some versions, must pre-exist. [:in:] - restrict statistical measure to the string inside :in: [:in: n m] - take a substring of :in:, starting at n and including m characters [:in: /regex/] - take a substring of :in: that matches the regex /word-pat/ - regex to describe what a parse-able word is. Default is /[[:graph:]]+/ /pR_offset/ - OSBF: change the classify threshold; with this optional parameter the success/failure decision point can be changed from the default 0 to what you specify. If given, the pR in 'stats' will be printed in the form pR/pR_offset. /svm-specific controls/ - a vector of seven parameters for SVM-classifiers clump [:text:] (clumpfile) (status) /regex/ /params/ - does incremental parametric clustering of documents to generate document groups. No pre-judged corpus is required. [:text:] - input text; var-restriction allowed (clumpfile) - name of file to hold the clumps (all docs go into the same clumpfile) (status) - Status output, for the result of the clump. Clumping the null input text will give a status dump of all the documents in the entire clumpfile. - special control flags; unigram, unique, and refute are supported, with the same meanings as in LEARN and CLASSIFY. Default clustering is by document-to-document nearest-neighbor hyper-spatial distance. If you add the bychunk flag, then the distance is to the cluster's centroid. /regex/ - optional tokenization regex; default is /[[:graph:]]+/ /params/ - control parameters: "tag=somename" label to later refer to this document. "clump=somename" forces a name onto a cluster. "n_clusters=N" says how many doc clusters you want; if N=0 then it will simply store the document and wait for more (much faster computationally). If N < 0 the number of clusters is determined automatically. cssanalyze (:c1:) (:report:) /params/ - analyze the CRM database :c1: and report into :report:. The analysis may include extensive integrity checks. - special control flags; default and basic are supported. /params/ - control parameters a la cssutil. WARNING: THIS IS A PLANNED COMMAND, which will obsolete the external cssutil tool in due time. cssbackup (:dst: :c1:) (:report:) /params/ - export/backup the CRM database :c1: into file :dst:. The output format is CSV. Messages are written to :report:. - special control flags; default is supported. /params/ - control parameters a la cssutil. WARNING: THIS IS A PLANNED COMMAND, which will obsolete the external cssutil tool in due time. csscreate (:c1:) (:report:) /params/ - create a new CRM database :c1: as specified by /params/. This command can be used to create new/empty CRM databases for any classifier. Messages are written to :report:. - special control flags; default is supported. /params/ - control parameters a la cssutil. WARNING: THIS IS A PLANNED COMMAND, which will obsolete the external cssutil tool in due time. cssdiff (:c1: :c2:) (:report:) /params/ - report differences between the CRM databases :c1: and :c2: into :report:. - special control flags; default and unique are supported. /params/ - control parameters a la cssmerge. WARNING: THIS IS A PLANNED COMMAND, which will obsolete the external cssdiff tool in due time. cssinfo (:c1:) (:report:) /params/ - list information about the CRM database :c1: into :report:. - special control flags; default is supported. /params/ - control parameters a la cssutil. WARNING: THIS IS A PLANNED COMMAND, which will obsolete the external cssutil tool in due time. cssmerge (:destfile: :c1:...:cN:) (:report:) /params/ - merge the CRM databases :c1: ... :cN: into destfile. Messages are written to :report:. - special control flags; default, unique, and microgroom are supported. /params/ - control parameters a la cssmerge. WARNING: THIS IS A PLANNED COMMAND, which will obsolete the external cssmerge tool in due time. cssrestore (:c1: :src:) (:report:) /params/ - import/restore the CRM database :c1: from file :src:. The input format is CSV. Messages are written to :report:. - special control flags; default is supported. /params/ - control parameters a la cssutil. WARNING: THIS IS A PLANNED COMMAND, which will obsolete the external cssutil tool in due time. debug - drop immediately into the interactive debugger. eval (:result:) /instring/ - repeatedly evaluates /instring/ until it ceases to change, then surgically places that result as the value of :result: . EVAL uses smart (but foolable) heuristics to avoid infinite loops, like evaluating a string that evaluates to a request to evaluate itself again. The error rate is about 1 / 2^62 and (in the default configuration) will detect looping chain groups of length 4096 or less. If the instring uses math evaluation (see section below on math operations) and the evaluation has an inequality test, (>, >=, <, <=, =, or !=) then if the test fails, the EVAL will FAIL to the end of block. Math is IEEE-compliant, so unreasonable things like divide-by-zero may yield NaN (Not A Number) or +/- INF exit /:exitcode:/ - ends program execution. If supplied, the return value is converted to an integer and returned as the exit code of the crm114 program. /:exitcode:/ - variable to be converted to an integer and returned. If no exit code is supplied, the exit code value is 0. fail - skips down to end of the current { } block and causes that block to exit with a FAIL status (see ALIUS for why this is useful) fault /faultstr/ - forces a FAULT with the given string as the reason. /faultstr/ - the val-expanded fault reason string goto /:label:/ - unconditional branch (you can use a variable as the goal, e.g. /:*:there:/ ) hash (:result:) /input/ - compute a fast 32-bit hash of the /input/, and ALTER :result: to the hexadecimal hash value. HASH is _not_ warranted to be constant across major releases of CRM114, nor is it cryptographically secure. (:result:) - value that gets result. /input/ - string to be hashed (can contain expanded :vars: , defaults to the data window :_dw: ) input [:filename:] input (:result:) [:filename:] input (:result:) [:filename: offset len] - read in the content of filename if no filename, then read stdin - read one line only - read one line only, using the history-aware readline library (:result:) - var that gets the input value (surgical overwrite). When no :result: variable has been specified, the default :_dw: input window is assumed. [:filename:] - the file to read. The first blank-delimited word is taken and var-expanded; the result is the filename, even if it includes embedded spaces. Default is to read stdin. [:filename: offset len] - optionally, move to offset in the file, and read len bytes. Offset and len are individually blank-delimited, and var-expanded with mathematics enabled. If len is unspecified, the read extends to EOF or buffer limit. intersect (:out:) [:var1: :var2: ...] - makes :out: contain the part of the data window that is the intersection of :var1 :var2: ... ISOLATEd vars are ignored. This only resets the value of the captured :out: variable, and does NOT alter any text in the data window. isolate (:var:) isolate (:var:) /initial-value/ isolate (:var:) /initial-value/ isolate (:var1: :var2: ... :varN:) - puts :var: into a data area outside of the default data window buffer; subsequent changes to this var don't change the data buffer (though they may change the value of any var subsequently set inside of this var). If the var already was ISOLATED, this is will stay isolated but it will surgically alter the value if a /value/ is given. - only create and set var if it didn't exist before (ideal for setting defaults) (:var:) - name of ISOLATEd var (var-expanded) /initial-value/ - optional initial value for :var: (var-expanded). If no value is supplied, the previous value is retained/copied. lazy - WARNING: reserved for future use learn (:class:) [:in:] /word-pat/ learn (:class:) [:in:] /word-pat/ /entropy_fuzz/ learn (:class:) [:in:] /word-pat/ /svm-specific controls/ - learn the statistics of the :in: var (or the input window if no var) as an example of class :class: - flag this is as an anti-example of this class- unlearn it! - ignore case in word-pat, does not ignore case in actual text (use tr() or the TRANSLATE command to do that on :in: if you want it) - enable the microgroomer to purge less-important information automatically whenever the statistics file gets to crowded. However, this disables other optimizations that can speed up - use orthogonal sparse bigram (OSB) features and Markovian classification instead of Markovian SBPH features. OSB uses a subset of SBPH featuers with about 1/4 the memory and disk needs, and about 4x the speed of full Markovian, - use the Fidelis Confidence Factor local probability generator. This format is not compatible with the default, but with single-sided threshold training ( typically pR of 10-30 ) achieves the best accuracy yet. - use the Winnow non-statistical classifier and the OSB front-end feature generator. Winnow uses .cow files, which are not compatible with the .css files for the Markovian (default) and OSB classifiers. Remember that for Winnow to be at it's best in accuracy, it has to be trained both with positive cases that failed to make a minimum threshold (typically with a per-file (not overall) match quality that was below a pR of .2 or more) as well as for "negative reinforcement" training for any "not in class" per-file match qualities that weren't at a pR of -.2 or less.) - use hyperspace matching; each learned document represents a light source in a 4-billion-dimensional hyperspace, and the set of sources that shines most brightly onto the unknown document's hyper-spatial location is the matching class. EXPERIMENTAL!!! - use single-word features only; using this this makes CRM114 almost exactly equivalent to most other Bayesian classifiers. Also works with the Winnow and hyperspace classifiers. - use the bit-entropy classifier. This uses compressibility of the unknown given the prior learned text as a perfect compressor model. No tokenization happens- this classifier works one bit at a time. The tokenizer regex is ignored; the second // argument can hold an optional "fuzz factor" for how close an approximation is allowed. - use the SVM classifier. This uses SVM (support vector machine) techniques. NB: for now VERY EXPERIMENTAL; OSB or unigram features (default OSB features), 2-class only, generates A_vs_B files. - use the String Kernel SVM. String kernels take one character at a time as token features, but don't use omitted subsections like the OSB feature set. VERY EXPERIMENTAL. 2-class only, generates A_vs_B files. - use a three-layer neural network with stochastic back-propagation training. VERY EXPERIMENTAL!!! - use the full correlative matcher. Very slow, but capable of matching stemmed words in any language and of matching binary files. Correlative matching does not tokenize, and so you don't need to supply it with a word-pat. (:class:) - name of file holding hashed results; nominal file extension is .css [:in:] - captured var containing the text to be learned (if omitted, the full contents of the data window is used) [:in: n m] - take a substring of :in:, starting at n and including m characters [:in: /regex/] - take a substring of :in: that matches the regex /word-pat/ - regex that defines a "word". Things that aren't "words" are ignored. Default is /[[:graph:]]+/. Ignored in correlation and bit-entropy. /entropy_fuzz/ Bit-entropy: this number is the "fuzz" factor in determining when to loop back the compression algorithm Markov chain versus allocating new nodes. You must specify an empty word-pat to use entropy fuzz. /svm-specific controls/ - a vector of seven parameters for SVM-classifiers liaf - skips UP to START of the current {} block (LIAF is FAIL spelled backwards) match /regex/ match [:in:] /regex/ match [:in: start len] /regex/ match [:in: /inregex/] /regex/ match (:var1: ...) [:in:] /regex/ match (:var1: ...) [:in: start len] /regex/ match (:var1: ...) [:in: /inregex/] /regex/ - Attempt to match the given regex; if match succeeds, variables are bound; if match fails, program skips to the closing '}' of this block - statement succeeds if match not present - ignore case when matching - No special characters in regex (only supported with TREregex, not GNUregex.) Think of this as WYSIWYG matching. - start match at start of the [:in:] var - start match at start of previous successful match on the [:in:] var - start match at one character past the start of the previous successful match on the [:in:] var - start match at one character past the end of prev. match on this [:in:] var - require match to end after end of prev. match on this [:in:] var - search backward in the [:in:] variable from the last successful match. - execute the search in blocks of one line of text each, so the result will never span a line. This means that ^ and $ will match at the beginning and end of each line, rather than the beginning and end of the full text. (:var1: ...) - optional result vars. The first var gets the text matched by the full regex. The second, third, etc. vars get each subsequent parenthesized sub-expression, in left-to-right order of the sub-expression's left parenthesis. These are "captures", not ALTERs, so text overlapping prior :var: values is left unchanged. [:in:] - search only in the variable specified; if omitted, :_dw: (the full input data window) is used [:in: start len] - search in the :in: input var, limiting the area searched to start to len (zero-origin counted) [:in: /inregex/ ] - search in the :in: input var, limiting the searched area to whatever matches the inregex (this doesn't use or affect previous successful match values) If the /inregex/ contain subregexes, the last subregex will be used to produce the limited :in: content to be matched against. /regex/ - POSIX regex (with \ escapes as needed) NB: If you build CRM114 to use the GNU regex library for MATCHing, be warned that GNU REGEX has numerous issues. See the KNOWN_BUGS file for a detailed listing. mutate (:dest:) [:src:] /from args/ /to args/ - WARNING: reserved for future use output [filename] /output-text/ - output an arbitrary string with captured values expanded. - append to the file (otherwise, the previous contents of the file is lost). [:filename:] - the file to write. The first blank-delimited word is taken and var-expanded; the result is the filename, even if it includes embedded spaces. Default output is to stdout. stderr is recognized. [:filename: offset len] - optionally, move to offset in the file, and maximum write len bytes. Offset and len are individually blank-delimited, and var-expanded with mathematics enabled. If len is unspecified, the write is the length of the expansion of /output-text/ /output-text/ - string to output (var-expanded) pmulc (clumpfile) [:text:] /regex/ - use the clumpfile as a look-up to translate documents to their appropriate clusters. The text does not get added into the clumpfile. [:text:] - input text; var-restriction allowed. (clumpfile) - name of file to holding the clumps /regex/ - optional tokenization regex; default is /[[:graph:]]+/ - The optional flags are bychunk, unique, and unigram, with the same functions as under clump. return /returnval/ - return from a CALL. Note that since CALL executes in shared space with the caller, all changes made in the CALLed routine are shared with the caller. /returnval/ - this (var-expanded) value is returned to the caller (or if the caller doesn't accept return values, it's discarded). routine - WARNING: reserved for future use sort - WARNING: reserved for future use syscall (:in:) (:out:) (:status:) /command_or_label/ [timeout pollcycle] - execute a shell command or fork to the specified label. This happens in a fresh copy of the environment; there is no communication with the main program except via the :in:, :out:, and :status: vars. Output over the buffer length is discarded unless you the process around for multiple readings. - don't send an EOF after feeding the full input (this will usually keep the syscalled process around). Later syscalls with the same :status: var will continue feeding to and reading from the kept process. - don't wait for process to output an EOF; just grab what's available in the process's output pipe and proceed (default limit per syscall is 256 Kb). The process then runs to completion independently and asynchronously. (This is "fire and forget" mode, and is mutually exclusive with . ) [timeout] - only allow the called process to run for N seconds. (0 = unlimited) [pollcycle] - set the I/O poll cycle to N seconds. Use to reduce CPU/OS load (higher N) or increase response speed for long running processes which produce very little output. (:in:) - var-expanded string to feed to command as input (can be null if you don't want to send the process something.) You _MUST_ specify this if you want to specify an :out: variable. (:out:) - var-expanded varname to place results into (MUST pre-exist, can be null if you don't want to read the process's output (yet, or at all). Limit per syscall is 256 Kbytes. You _MUST_ specify this if you want to use the :status: variable). This is a surgical alter. (:status:) - if you want to keep a minion proc around, or catch the exit status of the process, specify a varname here. The minion process's PID and pipes will be stored here. The program can access the proc again with another syscall by using this var again. When the process exits, it's exit code will be surgically stored here (unless you specified ) /command_or_label/ - the command or entrypoint you want to run. This arg is var-expanded; if the first word is a :label:, the fork begins execution at the label. If the first word is not a :label:, then the entire string is handed off to the shell to be executed as a shell command. This argument is optional: when you have specified and are performing a second/subsequent syscall, you can optionally dispense with the // arg. translate (:dest:) [:src:] /from_charset/ /to_charset/ - do a tr()-like translation of 8-bit characters in the from_charset to the corresponding characters in the to_charset. - repeated sequential copies of the same char in from_charset are replaced by a single copy, then translated. - from_charset and to_charset are literal, no var-expansion, ranging, or inversion performed. [:src:] - source of data. Can be var-restricted. Default is the default data window :_dw: (:dest:) - destination to put result. defaults to the default data window :_dw: /from_charset/ - var-expanded charset of characters to be translated from. Use hyphens for ranges like a-e meaning 'abcde'. Reversed ranges such as e-a, meaning 'edcba', work. (this is different than tr() !) Set inversion as in ^a-z mean all characters that aren't lower case characters works. Character duplication is not an error. To use - as a literal character, make it the first or last character. To use ^ as a literal character, make it any but the first character. ASCII \-escapes like \n and \xFF work. /to_charset/ - charset of characters to be translated to. Same rules as from_charset; excess characters are ignored; if not enough characters are available, start over using the to_charset characters from the beginning (this is different than tr().) If to_charset is not given, then all chars in from_charset are deleted. trap (:reason:) /trap_regex/ - traps faults from both FAULT statements and program errors occurring anywhere in the preceding bracket-block or single executable statement. If no fault exists, TRAP does a SKIP to end of block. If there is a fault and the fault reason string matches the trap_regex, the fault is trapped, and execution continues with the line after the TRAP, otherwise the fault is passed up to the next surrounding trapped bracket block. (:reason:) - the fault message that caused this FAULT. If it was a user fault, this is the text the user supplied in the FAULT statement. This variable is allocated as an ISOLATED variable. /trap_regex/ - the regex that determines what kind of faults this TRAP will accept. Putting a wildcard here (e.g. /.*/ means that ALL trappable faults will be trapped. union (:out:) [:var1: :var2: .. .] - makes :out: contain the union of the data window segments that contains var1, var2... plus any intervening text as well. Any ISOLATEd var is ignored. This is non-surgical, and does not alter the data window window (:w-var:) (:s-var:) /cut-regex/ /add-regex/ - window slider. This deletes to and including the cut-regex from :var: (default: use the data window), then reads adds from std. input till we find add-regex (inclusive). - ignore case when matching cut- and add- regexes. - (default) read one char at a time and check input for add-regex every character, so never reads "too much" from stdin. - reads as much data as available, then checks with the regex. (unused characters are kept around for later) - wait for EOF to check add-regex. (unused characters are kept around for later) - accept an EOF as being a successful regex match ( default is only a successful add-regex matches. CAUTION: can cause rapid looping!) - keep reading past an EOF; reset the stream and wait again for more input. (default is to FAIL on EOF. CAUTION: this can cause rapid looping!) (:w-var:) - what var to window (:s-var:) - what var to use for source (defaults to stdin, if you use a source var you _must_ specify the windowed var.) /cut-regex/ - var-expanded cut pattern. Everything up to and including this is deleted. /add-regex/ - var-expanded add pattern, if absent reads till EOF. This pattern is a minimal match pattern, so if the pattern can match a zero-length string ( say, /.*/ ), this can yield zero characters added. Use a pattern like /.+/ to prevent this. ***** If both cut-regex and add-regex are omitted, and this window statement is an executable no-op... EXCEPT that if it's the _first_ _executable_ statement in the program, then the WINDOW statement configures CRM114 to _not_ wait to read a anything from standard input input before starting program execution. ------------ A Quick Regex Intro --------- A regex is a pattern match. Do a "man 7 regex" for details. Matches are, by default "first starting point that matches, then longest match possible that can fit". a through z A through Z - all match themselves 0 through 9 most punctuation - matches itself, but check below! . the 'period' char, matches any character * repeat preceding 0 or more times + repeat preceding 1 or more times ? repeat preceding 0 or 1 time [abcde] any one of the letters a, b, c, d, or e [a-q] the letters a through q (just one of them) [a-eh-mqzt] the letters a through e, plus h through m, plus q, z, and t [^xyz] any one letter EXCEPT one of x, y, or z [^a-e] any one letter EXCEPT one of a through e {n} repetition count: match the preceding exactly n times {n,} repetition count: match the preceding at least n times {n,m} repetition count: match the preceding at least n and no more than m times (sadly, POSIX restricts this to a maximum of 255 repeats. Nested repeats like (.{255}){10} will work, but are very very slow). [[:<:]] matches at the start of a word (GNU regex only) \< matches at the start of a word (TRE regex only) [[:>:]] matches the end of a word (GNU regex only) \> matches at the end of a word (TRE regex only) ^ As the first character in a match, it matches only at the start of a block; this usually means start of the input variable. If you use then each line is it's own block and so ^ means "start of line". ^ As the last character in a match, it matches only at the end of a block; this usually means the end of the input variable. If you use then each line is it's own block and so $ means "end of line". . (a period) matches any _single_ character (except start-of-line or end of line "virtual characters", but it does match a newline). (match) the () go away, and the string that matched inside is available for capturing. Use \( and \) to match actual parenthesis. a|b match a _or_ b, such as foo|bar which will match "foo" or "bar" (multiple characters!). To get a shorter extent of ORing, use parenthesis, e.g. /f(oo|ba)r/ matches "foor" or "fbar", but not foo or bar. The following are other POSIX expressions, which mostly do what you'd guess they'd do from their names. [[:alnum:]] <-- a-z, A-Z and 0-9 [[:alpha:]] <-- a-z and A-Z [[:blank:]] <-- space and tab only [[:space:]] <-- "whitespace" (space, tab, vertical tab (^K), \n, \r, ..) [[:cntrl:]] <-- control characters [[:digit:]] <-- 0-9 [[:lower:]] <-- lower-case letters a-z [[:upper:]] <-- upper-case letters A-Z [[:graph:]] <-- any character that puts ink on paper or lights a pixel [[:print:]] <-- any character that moves the "print head" or cursor. [[:punct:]] <-- punctuation characters [[:xdigit:]] <-- hex digits 0-9, a-f and A-F ----- The following are only available with the TRE-based versions ----- *?, +?, ??, {n,m}? - repeat the preceding expression 0-or-more, 1-or-more, 0-or-1, or n-to-m times, but _shortest_ match that fits, given the already-selected start point of the regex. This is an "anti-greedy" match, unlike the normal match that wants to have the longest possible resulting match \N - where N is 1 through 9 - matches the N'th parenthesized previous sub-expression. You don't have to backslash-escape the backslash (e.g. write this as \1 or as \\1, either will work) \Q - start verbatim quoting - all following characters represent exactly themselves; no repeat counts or wildcards apply. This is _only_ terminated by a \E or the end of the regex. \E - end of verbatim quoting. \< - start of a word (doesn't use up a character) \> - end of a word (doesn't use up a character) \d - a digit \D - not a digit \s - a space \S - not a space \w - a word char ( a-z, A-Z, 0-9, or _ ) \W - not a word char (?:some-regex) - parenthesize a sub-expression, but _don't_ capture a sub-match for it. (?inr-inr:regex) - Let you turn on or off case independence, nomultiline, and right-associative (rather than the default left-associative) matching. These nest as well. i - case independent matching. examples: /(?i:abc)/ matches 'abc', 'AbC', 'ABC', etc... /(?i:ABC(?-i:de)FGH)/ matches ABCdeFGH, abcdefgh, but not ABCdEFGH or ABCDEFGH n - don't match newlines with wildcards such as .* or with anti-wildcards like [^j-z]. "-n" _allows_ matching of newlines (this is slightly counter-intuitive). e.g.: /(?n:a.*z)/ matches 'abcxyz' but not 'abc xyz' /(?-n:a.*z)/ matches both (this does NOT override the flag; essentially "blocks" the searched text at newlines, and searches within those blocks only) r - right-associate matching. This changes only sub-matches, never whether the match itself succeeds or fails. (I haven't come up with a good example for this; any suggestions?) -------------- Notes on Sequence of Evaluation ------------- By default, CRM114 supports string length and mathematical evaluation only in an EVAL statement, although it can be set to allow these in any place where a var-expanded variable is allowed (see the -q flag). The default value ( zero ) allows string length and math evaluation only in EVAL statements, and uses non-precedence (that is, strict left-to-right unless parenthesis are used) algebraic notation. -q 1 uses RPN instead of algebraic, again allowing string length and math evaluation only in EVAL expressions. Modes 2 and 3 allow string length and math evaluation in _any_ var-expanded expression, with non-precedence algebraic notation and RPN notation respectively. You can override whether to use Algebraic or RPN precedence of any math evaluation by using an A or an R as the first character of the math evaluation string. Evaluation is always left-to-right; there is no precedence of operators beyond the sequential passes noted below. The evaluation is done in four sequential passes: 1) \-constants like \n, \o377 and \x3F are substituted. You must use three digits for octal and two digits for hex. To write something that will literally appear as one of these constants, escape the backslash with another backslash, i.e. to output '\o075' use '\\o075'. 2) :*:var: variables are substituted (note the difference between a constant like '\n' and a variable like ":*:_nl:" here - constants are substituted first, then variables are substituted.). If there is no such variable, then the 'variable name' is it's own result, so :*:I_am_not_defined: yields ":I_am_not_defined:". 3) :+:var: indirection variables are substituted. This is equivalent to taking :*: twice immediately ( note that :*::*:foo:: does not do this as :*: cannot be nested - unless executed within an 'eval' command!) Note that if a regular variable is indirected, the result is unchanged (just as if a non-variable is :*: substituted; the result is the input) 4) :#:var: string-length operations are performed. (you don't have to expand a :var: first, you can take the string length directly, as in :#:_dw: to get the length of the default data window. Thus, you can take the length of a string that contains a :, which would normally "end" the :#: operator ). 5) :@:expression: mathematical expressions are performed; syntax is either RPN or non-precedenced (parens required) algebraic notation. Embedded non-evaluated strings in a mathematical expression is currently a no-no. If the first character of the math string is an A or an R, it forces Algebraic or RPN evaluation; otherwise the -q value determines which evaluator to use. Allowed operators are: + - * / % ^ v > < = >= <= != e E f F g G x X only. The '^' operator is exponentiation; A ^ B is A raised to the B power. The 'v' operator is any-base log; A v B is the log of B in logbase A ; note that the logbase is _required_ and there is no default. Only >, >=, <, <=, = and != set logical results; they also evaluate to 1 and 0 for continued chain operations - e.g. ((:*:a: > 3) + (:*:b: > 5) + (:*:c: > 9) > 2) is true IFF any of the following is true a > 3 and b > 5 a > 3 and c > 9 b > 5 and c > 9 Formatting operators: e E f F g G x X - the left side value is unchanged, but the right side value is used as a formatting precision value (note that x and X do not change precision), (i.e. the speed of light expressed in E 7.2 precision such as by 299792458 E 7.2 is 3.00E+08) The operators e, E, f, F, g, G, x, and X have the same meaning as in C. (beware a precision after the decimal of 10 though; and note that an x or X format is limited to 32 bits.) -------------- Notes on Approximate REGEX matching --------- The TRE regex engine (which is the default engine) supports approximate matching. The GNU engine does not support approximate matching. Approximate matching is specified similarly to a "repetition count" in a regular regex, using brackets. This approximation applies to the previous parenthesized expression (again, just like repitition counts). You can specify maximum total changes, and how many inserts, deletes, and substitutions you wish to allow. The minimum-error match is found and reported, if it exists within the bounds you state. The basic syntax is: (text-to-match){~[maxerrs] [#maxsubsts] [+maxinserts] [-maxdeletes]} Note that the '~' (with an optional maxerr count) is _required_ (that's how we know it's an approximate regex rather than just a rep-count); if you don't specify a max error count, you will get the best match, if you do, the match will have at most that many errors. Remember that you specify the changes to the text in the _pattern_ necessary to make it match the text in the string being searched. You cannot use approximate regexes and backrefs (like \1) in the same regex. This is a limitation of in TRE at this point. You can also use an inequality in addition to the basic syntax above: (text-to-match){~[maxerrs] [basic-syntax] [nI + mD + oS < K] } where n, m, and o are the costs per insertion, deletion, and substitution respectively, 'I', 'D', and 'S' are indicators to tell which cost goes with which kind of error, and K is the total cost of the errors; the cost of the errors is always strictly less than K. Here are some examples. (foobar) - exactly matches "foobar" (foobar){~} - finds the closest match to "foobar", with the minimum number of inserts, deletes, and substitutions. This match always succeeds, as six substitutions or additions is always enough to turn any string into one that contains 'foobar'. (foobar){~3} - finds the closest match to "foobar", with no more than 3 inserts, deletes, or substitutions (foobar){~2 +2 -1 #1) - find the closest match to "foobar", with at most two errors total, and at most two inserts, one delete, and one substitution. (foobar){~4 #1 1i + 2d < 5 } - find the closest match to "foobar", with at most four errors total, at most one substitution, and with the number of insertions plus 2x the number of deletions less than 5. (foo){~1}(bar){~1) - find the closest match to "foobar", with at most one error in the "foo" and one error in the "bar". ------------ Notes on Classifier Choices ------- CRM114 allows the user a whole gamut of different classification algorithms, and various tunings on classifications. The default classifier is a Markovian classifier that attempts to model the language as a Markov Random Field with site size of 5 (in plainspeak, it looks at each word in the context of a window 5 words long; words within that window are considered "directly related" and are used to generate local probabilities. Words outside that 5-word window are not considered in relation to each word, but get considered when the window slides over to them). The Markovian classifier is quite fast; more than fast enough for a single user or even a small office. Filtering speed varies- with no optimization and overflow safeguarding (that is, with enabled) filtering speed is usually in excess of what a fractional T1 line can downlink. The Markovian filter can be sped up considerably by turning off overflow safeguarding by not using ; this optimization speeds up learning significantly, but it means that learning is unsafe. System operators must instead manually monitor the fullness of the .css files and either manually groom them or expand them as required (or a script must be used to automate this maintenance, which can be done "in flight"). [ This classifier is the original CRM114 classifier and should be considered deprecated for new work, although it is still supported. The recommended classifier right now for production work is OSB or OSBF. ] The next generation filter (and one of the two recommended for new production work] is the OSB filter, based on orthogonal sparse bigrams. OSB is natively about 4x faster than full Markovian, but loses some of this advantage if overflow safeguarding (no ) is used. OSB is almost as accurate as Markovian if disk space is unlimited, and more accurate than Markovian if disk space is limited. OSB is the recommended default for new users because it works very well across a broad range of inputs. OSB uses .css files as well, but (because of a coding error that was released into the wild and unnoticed until most people were already using it in the incompatible form) OSB is, by default, incompatible with Markov .css files; there is a compile-time switch to make it compatible if you want. Another related classifier is the OSBF (OSB with Fidelis mods such as the ECCF dynamic weighting) filter. The good news is that OSBF can sometimes be even more accurate than OSB or Winnow, by using an exponential weighting to determine local probabilities, giving a filter is that it works very, Very, VERY well. It's incompatible with any of the other filters (uses .cfc files). It's also a good choice for new production work. Another filter with excellent statistics is the Winnow filter. Winnow is a non-statistical method that uses the OSB front end feature generator. Winnow is different than the statistical filters, in that it absolutely requires both positive training and negative training to work, but then it works _very_ well. With Winnow, you don't just train errors into the correct class (i.e. in emulation of an SVM). Instead, you set a "thick threshold" (usually about +/- 0.2 in the pR scale), and any positive class that doesn't get a per-correct-file score of at least 0.2 pR gets trained as a positive example. Symmetrically, any negative class and negative example that doesn't get below -0.2 of pR needs to be trained as a negative example (that is, using the flags .) This means that with Winnow, on an error you train one or both files. Even if the classifier gives the correct overall result, if the per-file pR values are inside the -0.2 <= per_file_pR <= 0.2 thick-threshold, you may have to train one or both files as well. (these per-file pR values are in the statistics output variable). The slowest classifier is the correlative filter. This filter is based on a full NxM correlation of the unknown text against each of the known text corpora. It's very slow (perhaps 100x slower than Markovian) but is capable of classifying texts containing stemmed words, of texts composed of binary files, and texts that cannot be reasonably "tokenized". The filter should be considered perpetually an experimental feature, and it is not as well characterized as the Markovian or OSB filters. The correlative filter is not recommended for general production work. A semi-experimental filter is the Hyperspace filter; this uses a variation on the K-Nearest-Neighbor method. It's usually not quite as accurate as OSB, but it can filter against very high levels of intentional obfuscation. Hyperspace uses a different (and self-growing) file format. Hyperspace usually trains best with a small thick-threshold training, similar to Winnow; as of 20061101 the factors have been re-normalized so that Hyperspace values within +/- 10 pR units give a good thick-threshold for training. The bit-entropy filter is a different *kind* of filter; instead of using tokens, it constructs an optimal compression system out of the known texts, then it tries to compress the unknown text as much as possible, using the known texts as prior probabilities. Better compression implies closer match. The amazing thing about this is that it works at all- and it actually works very well. Because there's no tokenizer, the entropy filter can work against languages that don't use spaces to delimit words, such as some Asian languages. It works quite well against spam. This filter is still experimental and non-compatible upgrades may occur - keep your training data if you use this filter! ------------ Using the CRM114 built-in script debugger --------- When you run crm114 with the '-d' commandline option or place a 'debug' statement in your scripts, you can use the crm114 built-in debugger to debug/analyze your scripts. These debugger commands are available (you can see the complete list when typing 'h' or '?' at the debugger prompt 'crm-dbg[...]>'): a :var: /value/ - alter :var: to /value/ b - toggle breakpoint on line b