24-Apr-87 13:34:25-EST,16794;000000000001
Return-Path: <sluggo@cunixc.columbia.edu>
Received: from cunixc.columbia.edu by CU20B.COLUMBIA.EDU with TCP; Fri 24 Apr 87 13:34:08-EST
Received: by cunixc.columbia.edu (5.54/5.10) id AA01984; Fri, 24 Apr 87 13:32:35 EST
Date: Fri, 24 Apr 87 13:32:35 EST
From: Thomas De Bellis <sluggo@cunixc.columbia.edu>
To: fdc@cunixc.columbia.edu
Subject: initial ccmd tutorial

cunixc:/u1/sy/sluggo/opser/cmdoc.txt, 27-Mar-1987 15:09:34 by sluggo

The following documents some aspects of the command package.  I make
no attempt to do this in any specified order.  The list is made by
looking over the ~sluggo/opser/cop.c program and explaining, in order
of usage, every ccmd function or structure used.  The cop program is
not the canonical nor the only example of the correct way to parse
things.  There are other programs, such as ccmdtest which provide more
examples.  These bear examination.

keywords:

A keyword table is an unordered list of possible keywords.  For the
purposes of this discussion, a keyword can be considered to be a
lexical item with more than one possible morphemes depending on parse
context.  Each entry contains the actual lexeme, some parse control
information and (optional) user information.  The KEYWRD structure
describes each keyword and the KEYTAB structure describes the use of
the table.

/*
 * KEYWRD structure specifies one entry in a keyword table.  KEYTAB
 * structure describes a table of keywords.
 */

typedef struct KEYWRD {
	char *	_kwkwd;		/* keyword string */
	short	_kwflg;		/* flags (see below) */
	int	_kwval;		/* arbitrary value, not used internally */
				/*  except for abbreviations... see KEY_ABR */
				/*  flag below */
} keywrd;

/* Flags that can be present in a keyword entry (in the _kwflg field) */

#define KEY_ABR 0x0001		/* keyword is an abbreviation for the */
				/* keyword indexed by this entry's _kwval */
				/* value */
#define KEY_NOR 0x0002		/* Ignore this keyword (do not recognize */
				/*  any prefix, or even an exact match) */
#define KEY_INV 0x0004		/* Invisible keyword (not shown with help) */
#define KEY_MAT 0x0008		/* This keyword matches current input (used */
				/*  internally) */

Here is an example of a keyword table from the cop program.  Note that
the keyword table is an array of KEYWRD structures and, unlike the
TBLUK% tables used by other parsers such as the DEC20 COMND% parser
and the Pascal interface, it does not have to be in alphabetical
order.

/* The following in the main keyword table.  Each entry consists of
 * the keyword name, some flags for the parser and an internal token
 * which is used efficient programmed dispatch.  Note that this table
 * is in alphabetical order for historical reasons.  The numerical
 * order of the tokens, however, is irrelevant since it references
 * itself.  See cop.h for further trivia.
 */

keywrd cmds[] = {		/* Begin toplevel keyword table */

  { "b", KEY_ABR|KEY_INV, 1},	/* b to backup (1 is index in this table) */
  { "backup", 0, BACKUCMD },	/* Start some kind of backup script */
  { "bye", KEY_INV,EXITCMD },	/* Synonym for exit */
  { "exit", 0, EXITCMD },	/* Get out of Cop */
  { "fsck", 0, FSCKCMD },	/* Check the file system */
  { "halt", 0, HALTCMD },	/* Halt the CPU */
  { "help", 0, HELPCMD },	/* Help the user */
  { "quit", KEY_INV,EXITCMD },	/* Synonym for exit */
  { "restart", 0, RESTACMD },	/* Turn time sharing on if off */
  { "shutdown", 0, SHUTDCMD },	/* Shut down the computer */
  { "take", 0, TAKECMD },	/* Take commands from a file */
  { "users", 0, USERSCMD },	/* Show users on system */
  { "wall", 0, WALLCMD },	/* Write a message to all users */
  { "!sh", KEY_NOR|KEY_INV, SHELLCMD }	/* Spawn a shell */

	         };		/* End toplevel keyword table */

Note that the _kwval field for each keyword is filled in with a user
specified token.  It is the programmer's responsibility to interprete
this number in all cases except when the KEY_ABR flag is used in the
_kwflg field.  In this case, the number in the _kwval field is used as
an index into the entire array of structures.  In other words, "b" is
an abbreviation for "backup".  The 1 in the _kwval field will never
been seen by the programmer in the course of an ordinary parse--it
refers to the index of the "backup" keyword in the structure (ie,
position 1).

In order to parse keywords, the ccmd package must have some additional
control information.  This is supplied by the KEYTAB structure which
is made up of two parts.  The _kycnt field supplies the parser with a
count of the keywords (that is, KEYWRD structures) in the table and
the _ktwds field supplies an actual pointer to the keyword table in
question.


typedef struct KEYTAB {
	int	_ktcnt;		/* number of keywords in table */
	keywrd * _ktwds;	/* array of keyword entries */
} keytab;


Here is an example from the cop program.

/* The following is the command table structure which is used to keep
 * track of our main keyword table.
 */

keytab cmdtab = { (sizeof(cmds)/sizeof(keywrd)), cmds };

Note the previous nearly obvious trick used to count the number of
keywords in the keyword table: We get the size of the entire table and
divide by the size of an individual entry.

Other structures are used in the course of a parse.  Various buffers
are used to accumulate data; these are the command buffer, the atom
buffer and the working buffer.  The command buffer contains a complete
typescript of the final user input (without editing characters).  The
atom buffer contains a copy of the last recognized input token.  An
example of this might be the last parsed user keyword.  The working
buffer is used internally by the parser to shuttle characters around.
Here are some sample buffer declarations.

int cmdbuf[BUFSIZ];		/* The command buffer */
char atmbuf[BUFSIZ];		/* The atom buffer */
char wrkbuf[BUFSIZ];		/* The working buffer */

Note that the command buffer is *not* declared as a character array!
This is because other portions of each int slot are used to store
flags about each character.  An example might be if the character is
to be echoed or not.  The command buffer might have been more clearly
defined as a struct made up of a char and a short for flags, but
that's life, I guess.  Normally, a user program only references the
atom buffer to inspect and possibly copy input tokens.

Another structure that is used is the pval structure.  It is the union
of all the possible types of tokens that a parse can return.  The user
program can use infomation in the pval structure to recognize which
keyword was used out of a list of keywords, for example.  Here is the
current union declaration

/* Union declaration for parse return values */

typedef union PVAL {
	int _pvint;
	float _pvflt;
	char _pvchr;
	char *_pvstr;
	char **_pvstrvec;
        datime _pvtad;
        pvfil _pvfil;
        struct passwd ** _pvusr;
        struct group ** _pvgrp;
        char * _pvpara;
} pval;

Here is a sample declaration from cop:

pval parseval;

Note how this might be used after a parse for efficient programmed
dispatch.  We parsed a keyword and know that the _kwval field will now
be found in the parseval struct.  Thus, we can use a switch statement
and have a case for every keyword.  That is, we can have the user data
in the KEYWRD structure index us directly to the associated semantic
action routine for that keyword.

    switch(parseval._pvint) {
    case USERSCMD:	/* Show users on system */
      users();
      break;

The FDB structure can be used to either describe specific fields of a
parse (from whence comes the name FDB, which means function descriptor
block) or to return information to the user program as to what was
actually parsed.  Note the _cmlst field which implies that FDB's can
be linked.  This used to inform the parser that a number of lexical
items are valid for this specific position in the input stream.  For
example, suppose we have a number of lexemes which map into the same
internal morpheme.  A user program might want to be able to parse a
number when asking for telephonic information but not preclude the
possibility of someone typing in the number as word (in which case,
we'd want to think about parsing keywords).

/*
 * FDB structures hold information required to parse specific fields of
 * a command line.
 */

typedef struct FDB {
	int	_cmfnc;		/* Function code for this field */
	int	_cmffl;		/* Function specific parse flags */
	struct FDB * _cmlst;	/* Link to alternate FDB */
	pdat	_cmdat;		/* Function specific parsing data */
	char *	_cmhlp;		/* pointer to help string */
	char *	_cmdef;		/* pointer to default string */
	brktab * _cmbrk;	/* pointer to special break table */
} fdb;

/* Common flag defined for all parse functions */
#define	CM_SDH	0x8000		/* Suppress default help message */

Here is an example of an fdb that is used to instruct the parser to
parse a keyword.  _CMKEY says that this we should parse a keyword.
The address of the previously described keyword table is coerced into
a pdat (a union of possible parse specific data).  The break table
will be described later.

  static fdb cmdfdb = { _CMKEY, 0, NULL, (pdat) &cmdtab, "Command, ", 
			  NULL, &keybrk };


Here is an example of an fdb that is used to receive data from the
parse. 

fdb *used;

After the parse, we might inspect the used->_cmdat field to see which
token was recognized.

When parsing, it is useful to change the characteristics of what
characters the parser considers as lexical stops.  For example, a
lexeme (a word) is surrounded by lexical stop characters which are
usually spaces but may be other grammatical symbols such as period and
comma.  The parser knows to break lexemes based on these lexical
stops.  Thus, stop.now would yield two keywords "stop" and "now" (plus
a token ".").  However, we may need to change these notions in order
to parse what seem to be the multiple lexical items "stop.now" as one
lexeme.

Break tables provide us this functionality by allowing us to tell the
parser what graphic characters are lexical stops or not.  A BRKTAB
structure is a pair of 128-bit arrays specifying the break
characteristics of the ASCII characters.  The _br1st array specifies
characters which will break field input when they are typed as the
first character of a field.  The _brrest array specifies characters
that break in other than the first position.

Each array contains one bit per ASCII code, ordered according to the
ASCII collating sequence.  The leftmost (most significant) bit of the
first byte corresponds to ASCII code 0, and the rightmost bit of that
same byte corresponds to ASCII code 7.  The leftmost bit of the second
byte is for ASCII code 8, and so on.  When a bit is on, the
corresponding character will act as a break character, otherwise it
will not.

typedef struct BRKTAB {
	char _br1st[16];	/* Bit array for initial character breaks */
	char _brrest[16];	/* Bit array for subsequent breaks */
} brktab;

Here is an example of a break table from the cop program.  It is set
to prevent the "!" character from being considered as a lexical stop
so that we can parse the "!sh" keyword as a single lexical item.
Somebody ought to ask Lowry why we need two break tables...

  static brktab keybrk = {		/* standard break table */
    {					/* 1st char break array */
					/* all but letters, digits, hyphen */
      0xff, 0xff, 0xff, 0xff, 0xbf, 0xfb, 0x00, 0x3f,
      0x80, 0x00, 0x00, 0x1f, 0x80, 0x00, 0x00, 0x1f
    },
    {					/* subsequent char break array */
					/* same as above */
      0xff, 0xff, 0xff, 0xff, 0xbf, 0xfb, 0x00, 0x3f, 
      0x80, 0x00, 0x00, 0x1f, 0x80, 0x00, 0x00, 0x1f
    }

  };

Having described some structures in the command parsing package, we
can now examine a few elementary routines which use them.

cmbufs();

The cmbufs() routine is used to set pointers and counters in the
command state block to the various working buffers.  Assuming the
previous descriptions of these buffers, here is a typical call:

  cmbufs(cmdbuf,BUFSIZ,atmbuf,BUFSIZ,wrkbuf,BUFSIZ); /* init ccmd */

That is, we set the address and size of the command buffer, the atom
buffer and the working buffer.  This must be done before any call to
the parser, otherwise we will have no areas to work in.  Cmbufs always
returns true.

cmseti();

The cmseti() routine is used to tell the parser which channels to do
input and output on.  If any of these are a terminal, cmseti() will
also properly condition the terminal for command parsing.  Among other
things, this means setting cbreak mode and no echo.  Here is a sample
call:

  cmseti(stdin, stdout, stderr);

Obviously, this is the usual case of reading input from the user and
writing it on the terminal with errors going to standard error.
Another, perhaps more interesting, application would be to have
standard error going to a pipe.  A child process might interprete and
log the error message and then print it on stderr.  The point is that
it makes no difference what these channels are.  Cmseti() returns an
error if it can't condition the terminal.  It must be issued before
parsing begins.

cmdone();

The cmdone() routine is used to terminate parsing.  It resets the
terminal characteristics for any channels that were conditioned with
cmseti().  It is different from the cmtend() and cmtset() pair in that
cmdone() is indicative of a complete end of program.  It is usually
soon followed by an exit() or a return() from main().

Having properly conditioned the terminal and defined various working
buffers, we are now ready to give the parser information as to how to
parse and what kind of information to present to the user.  Here is a
very typical example of a main parse loop taken from cop.

    cmseter();				/* error come back here */
    if (cmcsb._cmerr == CMxEOF) {	/* exit on EOF */
      done = TRUE;
      continue;
    }
    prompt("opr>");			/* prompt */
    cmsetrp();				/* reparse comes here */
    parse(&cmdfdb,&parseval,&used);


cmseter();

The cmseter() routine is used to set the error return address in the
command state block.  If the parser is completely unable to parse a
given input stream, it may decide to clear that input stream and start
over.  To do this, it needs to know where to jump in the user program
to reinitiate parsing.  Cmseter() should be called before the prompt()
routine so that a parse error will cause a new prompt to be written
after the error message is typed.  Cmseter() does not return an error.

prompt();

The prompt() routine is used to issue the current program prompt to
the user.  This is usually some string which is either the name of the
program or evocative of the current function being performed.  Thus
opr> mean that the program wants to parse operation commands.

cmsetrp();

The cmsetrp() routine is used to set the current reparse address.
This is as distinct from the parse error address.  A parse error
indicates an unrecognizable input stream which must be cleared and
resynchronized.  A reparse is caused by the deterministic nature of
the command parser.  For example, a user might be typing a series of
tokens and then change his mind and back up and issue a different
sequence.  While the program can still parse all this, it must
reinitiate parsing to properly retrack the user input.  Thus, we
notice that the call to cmsetrp() is given *after* the call to
prompt().  This sets the reparse jump address in the command state
block to point to after the prompt issuance so that a reparse does not
cause us to retype the prompt and possibly confuse the user.
Cmsetrp() returns no error.

parse();

As might be guessed, the parse() routine is was actually invokes the
ccmd parser proper.  It is passed three arguments.  The first is the
address of a function descriptor block which describes, possibly
through the use of linked FDB structures, valid tokens for this
position in the input stream.  The second is the address of a pval
structure.  This is used to return information specific for the parse.
For a number, this might be an int for the actual number parsed.  For
a keyword, it would be the contents of the _kyval field for the
selected keyword in the keyword table.  The third argument points to a
function descriptor block which receives a copy the fdb selected for
that particular parse when using linked fdbs.  To parse a keyword from
the cop main command level, given the previous definitons, we simply:

    parse(&cmdfdb,&parseval,&used);