24-Apr-87 13:34:25-EST,16794;000000000001 Return-Path: Received: from cunixc.columbia.edu by CU20B.COLUMBIA.EDU with TCP; Fri 24 Apr 87 13:34:08-EST Received: by cunixc.columbia.edu (5.54/5.10) id AA01984; Fri, 24 Apr 87 13:32:35 EST Date: Fri, 24 Apr 87 13:32:35 EST From: Thomas De Bellis To: fdc@cunixc.columbia.edu Subject: initial ccmd tutorial cunixc:/u1/sy/sluggo/opser/cmdoc.txt, 27-Mar-1987 15:09:34 by sluggo The following documents some aspects of the command package. I make no attempt to do this in any specified order. The list is made by looking over the ~sluggo/opser/cop.c program and explaining, in order of usage, every ccmd function or structure used. The cop program is not the canonical nor the only example of the correct way to parse things. There are other programs, such as ccmdtest which provide more examples. These bear examination. keywords: A keyword table is an unordered list of possible keywords. For the purposes of this discussion, a keyword can be considered to be a lexical item with more than one possible morphemes depending on parse context. Each entry contains the actual lexeme, some parse control information and (optional) user information. The KEYWRD structure describes each keyword and the KEYTAB structure describes the use of the table. /* * KEYWRD structure specifies one entry in a keyword table. KEYTAB * structure describes a table of keywords. */ typedef struct KEYWRD { char * _kwkwd; /* keyword string */ short _kwflg; /* flags (see below) */ int _kwval; /* arbitrary value, not used internally */ /* except for abbreviations... see KEY_ABR */ /* flag below */ } keywrd; /* Flags that can be present in a keyword entry (in the _kwflg field) */ #define KEY_ABR 0x0001 /* keyword is an abbreviation for the */ /* keyword indexed by this entry's _kwval */ /* value */ #define KEY_NOR 0x0002 /* Ignore this keyword (do not recognize */ /* any prefix, or even an exact match) */ #define KEY_INV 0x0004 /* Invisible keyword (not shown with help) */ #define KEY_MAT 0x0008 /* This keyword matches current input (used */ /* internally) */ Here is an example of a keyword table from the cop program. Note that the keyword table is an array of KEYWRD structures and, unlike the TBLUK% tables used by other parsers such as the DEC20 COMND% parser and the Pascal interface, it does not have to be in alphabetical order. /* The following in the main keyword table. Each entry consists of * the keyword name, some flags for the parser and an internal token * which is used efficient programmed dispatch. Note that this table * is in alphabetical order for historical reasons. The numerical * order of the tokens, however, is irrelevant since it references * itself. See cop.h for further trivia. */ keywrd cmds[] = { /* Begin toplevel keyword table */ { "b", KEY_ABR|KEY_INV, 1}, /* b to backup (1 is index in this table) */ { "backup", 0, BACKUCMD }, /* Start some kind of backup script */ { "bye", KEY_INV,EXITCMD }, /* Synonym for exit */ { "exit", 0, EXITCMD }, /* Get out of Cop */ { "fsck", 0, FSCKCMD }, /* Check the file system */ { "halt", 0, HALTCMD }, /* Halt the CPU */ { "help", 0, HELPCMD }, /* Help the user */ { "quit", KEY_INV,EXITCMD }, /* Synonym for exit */ { "restart", 0, RESTACMD }, /* Turn time sharing on if off */ { "shutdown", 0, SHUTDCMD }, /* Shut down the computer */ { "take", 0, TAKECMD }, /* Take commands from a file */ { "users", 0, USERSCMD }, /* Show users on system */ { "wall", 0, WALLCMD }, /* Write a message to all users */ { "!sh", KEY_NOR|KEY_INV, SHELLCMD } /* Spawn a shell */ }; /* End toplevel keyword table */ Note that the _kwval field for each keyword is filled in with a user specified token. It is the programmer's responsibility to interprete this number in all cases except when the KEY_ABR flag is used in the _kwflg field. In this case, the number in the _kwval field is used as an index into the entire array of structures. In other words, "b" is an abbreviation for "backup". The 1 in the _kwval field will never been seen by the programmer in the course of an ordinary parse--it refers to the index of the "backup" keyword in the structure (ie, position 1). In order to parse keywords, the ccmd package must have some additional control information. This is supplied by the KEYTAB structure which is made up of two parts. The _kycnt field supplies the parser with a count of the keywords (that is, KEYWRD structures) in the table and the _ktwds field supplies an actual pointer to the keyword table in question. typedef struct KEYTAB { int _ktcnt; /* number of keywords in table */ keywrd * _ktwds; /* array of keyword entries */ } keytab; Here is an example from the cop program. /* The following is the command table structure which is used to keep * track of our main keyword table. */ keytab cmdtab = { (sizeof(cmds)/sizeof(keywrd)), cmds }; Note the previous nearly obvious trick used to count the number of keywords in the keyword table: We get the size of the entire table and divide by the size of an individual entry. Other structures are used in the course of a parse. Various buffers are used to accumulate data; these are the command buffer, the atom buffer and the working buffer. The command buffer contains a complete typescript of the final user input (without editing characters). The atom buffer contains a copy of the last recognized input token. An example of this might be the last parsed user keyword. The working buffer is used internally by the parser to shuttle characters around. Here are some sample buffer declarations. int cmdbuf[BUFSIZ]; /* The command buffer */ char atmbuf[BUFSIZ]; /* The atom buffer */ char wrkbuf[BUFSIZ]; /* The working buffer */ Note that the command buffer is *not* declared as a character array! This is because other portions of each int slot are used to store flags about each character. An example might be if the character is to be echoed or not. The command buffer might have been more clearly defined as a struct made up of a char and a short for flags, but that's life, I guess. Normally, a user program only references the atom buffer to inspect and possibly copy input tokens. Another structure that is used is the pval structure. It is the union of all the possible types of tokens that a parse can return. The user program can use infomation in the pval structure to recognize which keyword was used out of a list of keywords, for example. Here is the current union declaration /* Union declaration for parse return values */ typedef union PVAL { int _pvint; float _pvflt; char _pvchr; char *_pvstr; char **_pvstrvec; datime _pvtad; pvfil _pvfil; struct passwd ** _pvusr; struct group ** _pvgrp; char * _pvpara; } pval; Here is a sample declaration from cop: pval parseval; Note how this might be used after a parse for efficient programmed dispatch. We parsed a keyword and know that the _kwval field will now be found in the parseval struct. Thus, we can use a switch statement and have a case for every keyword. That is, we can have the user data in the KEYWRD structure index us directly to the associated semantic action routine for that keyword. switch(parseval._pvint) { case USERSCMD: /* Show users on system */ users(); break; The FDB structure can be used to either describe specific fields of a parse (from whence comes the name FDB, which means function descriptor block) or to return information to the user program as to what was actually parsed. Note the _cmlst field which implies that FDB's can be linked. This used to inform the parser that a number of lexical items are valid for this specific position in the input stream. For example, suppose we have a number of lexemes which map into the same internal morpheme. A user program might want to be able to parse a number when asking for telephonic information but not preclude the possibility of someone typing in the number as word (in which case, we'd want to think about parsing keywords). /* * FDB structures hold information required to parse specific fields of * a command line. */ typedef struct FDB { int _cmfnc; /* Function code for this field */ int _cmffl; /* Function specific parse flags */ struct FDB * _cmlst; /* Link to alternate FDB */ pdat _cmdat; /* Function specific parsing data */ char * _cmhlp; /* pointer to help string */ char * _cmdef; /* pointer to default string */ brktab * _cmbrk; /* pointer to special break table */ } fdb; /* Common flag defined for all parse functions */ #define CM_SDH 0x8000 /* Suppress default help message */ Here is an example of an fdb that is used to instruct the parser to parse a keyword. _CMKEY says that this we should parse a keyword. The address of the previously described keyword table is coerced into a pdat (a union of possible parse specific data). The break table will be described later. static fdb cmdfdb = { _CMKEY, 0, NULL, (pdat) &cmdtab, "Command, ", NULL, &keybrk }; Here is an example of an fdb that is used to receive data from the parse. fdb *used; After the parse, we might inspect the used->_cmdat field to see which token was recognized. When parsing, it is useful to change the characteristics of what characters the parser considers as lexical stops. For example, a lexeme (a word) is surrounded by lexical stop characters which are usually spaces but may be other grammatical symbols such as period and comma. The parser knows to break lexemes based on these lexical stops. Thus, stop.now would yield two keywords "stop" and "now" (plus a token "."). However, we may need to change these notions in order to parse what seem to be the multiple lexical items "stop.now" as one lexeme. Break tables provide us this functionality by allowing us to tell the parser what graphic characters are lexical stops or not. A BRKTAB structure is a pair of 128-bit arrays specifying the break characteristics of the ASCII characters. The _br1st array specifies characters which will break field input when they are typed as the first character of a field. The _brrest array specifies characters that break in other than the first position. Each array contains one bit per ASCII code, ordered according to the ASCII collating sequence. The leftmost (most significant) bit of the first byte corresponds to ASCII code 0, and the rightmost bit of that same byte corresponds to ASCII code 7. The leftmost bit of the second byte is for ASCII code 8, and so on. When a bit is on, the corresponding character will act as a break character, otherwise it will not. typedef struct BRKTAB { char _br1st[16]; /* Bit array for initial character breaks */ char _brrest[16]; /* Bit array for subsequent breaks */ } brktab; Here is an example of a break table from the cop program. It is set to prevent the "!" character from being considered as a lexical stop so that we can parse the "!sh" keyword as a single lexical item. Somebody ought to ask Lowry why we need two break tables... static brktab keybrk = { /* standard break table */ { /* 1st char break array */ /* all but letters, digits, hyphen */ 0xff, 0xff, 0xff, 0xff, 0xbf, 0xfb, 0x00, 0x3f, 0x80, 0x00, 0x00, 0x1f, 0x80, 0x00, 0x00, 0x1f }, { /* subsequent char break array */ /* same as above */ 0xff, 0xff, 0xff, 0xff, 0xbf, 0xfb, 0x00, 0x3f, 0x80, 0x00, 0x00, 0x1f, 0x80, 0x00, 0x00, 0x1f } }; Having described some structures in the command parsing package, we can now examine a few elementary routines which use them. cmbufs(); The cmbufs() routine is used to set pointers and counters in the command state block to the various working buffers. Assuming the previous descriptions of these buffers, here is a typical call: cmbufs(cmdbuf,BUFSIZ,atmbuf,BUFSIZ,wrkbuf,BUFSIZ); /* init ccmd */ That is, we set the address and size of the command buffer, the atom buffer and the working buffer. This must be done before any call to the parser, otherwise we will have no areas to work in. Cmbufs always returns true. cmseti(); The cmseti() routine is used to tell the parser which channels to do input and output on. If any of these are a terminal, cmseti() will also properly condition the terminal for command parsing. Among other things, this means setting cbreak mode and no echo. Here is a sample call: cmseti(stdin, stdout, stderr); Obviously, this is the usual case of reading input from the user and writing it on the terminal with errors going to standard error. Another, perhaps more interesting, application would be to have standard error going to a pipe. A child process might interprete and log the error message and then print it on stderr. The point is that it makes no difference what these channels are. Cmseti() returns an error if it can't condition the terminal. It must be issued before parsing begins. cmdone(); The cmdone() routine is used to terminate parsing. It resets the terminal characteristics for any channels that were conditioned with cmseti(). It is different from the cmtend() and cmtset() pair in that cmdone() is indicative of a complete end of program. It is usually soon followed by an exit() or a return() from main(). Having properly conditioned the terminal and defined various working buffers, we are now ready to give the parser information as to how to parse and what kind of information to present to the user. Here is a very typical example of a main parse loop taken from cop. cmseter(); /* error come back here */ if (cmcsb._cmerr == CMxEOF) { /* exit on EOF */ done = TRUE; continue; } prompt("opr>"); /* prompt */ cmsetrp(); /* reparse comes here */ parse(&cmdfdb,&parseval,&used); cmseter(); The cmseter() routine is used to set the error return address in the command state block. If the parser is completely unable to parse a given input stream, it may decide to clear that input stream and start over. To do this, it needs to know where to jump in the user program to reinitiate parsing. Cmseter() should be called before the prompt() routine so that a parse error will cause a new prompt to be written after the error message is typed. Cmseter() does not return an error. prompt(); The prompt() routine is used to issue the current program prompt to the user. This is usually some string which is either the name of the program or evocative of the current function being performed. Thus opr> mean that the program wants to parse operation commands. cmsetrp(); The cmsetrp() routine is used to set the current reparse address. This is as distinct from the parse error address. A parse error indicates an unrecognizable input stream which must be cleared and resynchronized. A reparse is caused by the deterministic nature of the command parser. For example, a user might be typing a series of tokens and then change his mind and back up and issue a different sequence. While the program can still parse all this, it must reinitiate parsing to properly retrack the user input. Thus, we notice that the call to cmsetrp() is given *after* the call to prompt(). This sets the reparse jump address in the command state block to point to after the prompt issuance so that a reparse does not cause us to retype the prompt and possibly confuse the user. Cmsetrp() returns no error. parse(); As might be guessed, the parse() routine is was actually invokes the ccmd parser proper. It is passed three arguments. The first is the address of a function descriptor block which describes, possibly through the use of linked FDB structures, valid tokens for this position in the input stream. The second is the address of a pval structure. This is used to return information specific for the parse. For a number, this might be an int for the actual number parsed. For a keyword, it would be the contents of the _kyval field for the selected keyword in the keyword table. The third argument points to a function descriptor block which receives a copy the fdb selected for that particular parse when using linked fdbs. To parse a keyword from the cop main command level, given the previous definitons, we simply: parse(&cmdfdb,&parseval,&used);