The Kermit Project |
Now hosted by
Panix.com
New York City USA •
kermit@kermitproject.org
| ||||||||
|
Frank da Cruz
The Kermit Project
Last update: Thu Mar 23 11:00:30 2017
This page assumes familiarity with Kermit scripting and rudiments of programming, as presented in Using C-Kermit, Second Edition: Arrays, loops, and so on. And with the file input/output features of C-Kermit 7.0 and later, documented HERE.A Tab-Separated Value (TSV) file is composed of lines of text. Each line is a record composed of fields, and the fields are separated by the ASCII Horizontal Tab (HT) character, code value decimal 9. The text from the beginning of a line to the first Tab is the first field; the text between the first and second Tab is the second field and so on. Leading and tailing spaces are considered part of the field. There are no special characters except for Tab itself. TSV format is commonly used for exporting tables from spreadsheets, databases, and Web pages.
A Comma-Separated Value (CSV) file is like a TSV file, except that (a) fields are separated by comma rather than Tab, and (b) there are all kinds of rules for quoting and special characters and leading and trailing spaces:
Here is an example:
The first two are regular fields. The second is a field that has an embedded space but in which any leading or trailing spaces are to be ignored. The fourth is an empty field, but still a field. The fifth is a field that contains embedded commas. The sixth has leading and trailing spaces. The last field has embedded quotation marks.aaa, bbb, has spaces,,"ddd,eee,fff", " has spaces ","Muhammad ""The Greatest"" Ali"
The \fsplit() function in C-Kermit 9.0 handles both TSV and CSV formats. In earlier releases \fsplit() could not handle CSV at all, because there was no provision for a separator to be surrounded by whitespace that was to be considered part of the separator nor for quoting doublequotes inside of a quoted string..
The "TSV" argument means that a field includes all characters except Tab. A break character of Tab (\9) is assumed, but you can specify a different character or break set if you wish, but the simple rules for TSV files will still apply if you put "TSV" as the fourth agument.fopen /read \%c somefile.tsv if fail exit .n = 0 while true { fread /line \%c record if fail break incr n void \fsplit(\m(record),&a,,TSV) echo echo RECORD \m(n): show array a } fclose \%c exit
void \fsplit(\m(record),&a,\44,TSV)That is, treat it as a TSV file but specify comma (\44) as the field separator. Experimentation with spreadsheets such as LibreOffice Calc shows that they attempt to “autorecognize” the syntax of a CSV file. But that doesn't mean you can intermix the two different styles in one CSV file.
CSV files are handled the same way, but in this case you put "CSV" as the 4th parameter to \fsplit(). Here's an illustration:
which gives the following results:def xx { echo [\fcontents(\%1)] .\%9 := \fsplit(\fcontents(\%1),&a,,CSV) for \%i 1 \%9 1 { echo "\flpad(\%i,3). [\&a[\%i]]" } echo "-----------" } xx {a,b,c} xx { a , b , c } xx { aaa,,ccc," with spaces ",zzz } xx { "1","2","3","","5" } xx { this is a single field } xx { this is one field, " and this is another " } xx { name,"Mohammad ""The Greatest"" Ali", age, 67 } xx { """field enclosed in doublequotes""" } exit
The break character (3rd parameter) can be omitted if the 4th parameter is "CSV", in which case comma (\44) is used. When “CSV” is specified as the include set:[a,b,c] 1. [a] 2. [b] 3. [c] ----------- [ a , b , c ] 1. [a] 2. [b] 3. [c] ----------- [ aaa,,ccc," with spaces ",zzz ] 1. [aaa] 2. [] 3. [ccc] 4. [ with spaces ] 5. [zzz] ----------- [ "1","2","3","","5" ] 1. [1] 2. [2] 3. [3] 4. [] 5. [5] ----------- [ this is a single field ] 1. [this is a single field] ----------- [ this is one field, " and this is another " ] 1. [this is one field] 2. [ and this is another ] ----------- [ name,"Mohammad ""The Greatest"" Ali", age, 67 ] 1. [name] 2. [Mohammad "The Greatest" Ali] 3. [age] 4. [67] ----------- [ """field enclosed in doublequotes""" ] 1. ["field enclosed in doublequotes"] -----------
Of course you can specify any separator(s) you want with either the CSV, TSV, or ALL symbolic include sets. For example, if you have a TSV file in which you want the spaces around each Tab to be discarded, you can use:
\9 is Tab.\fsplit(variable, &a, \9, ALL)
The new symbolic include sets can also be used with \fword(), which is just like \fsplit() except that it retrieves the nth word from the argument string, rather than making an array of all the words. In C-Kermit you can get information about these or any other functions with the HELP FUNCTION command, e.g.:
C-Kermit> help func word Function \fword(s1,n1,s2,s3,n2,n3) - Extracts a word from a string. s1 = source string. n1 = word number (1-based) counting from left; if negative, from right. s2 = optional break set. s3 = optional include set (or ALL, CSV, or TSV). n2 = optional grouping mask. n3 = optional separator flag: 0 = collapse adjacent separators; 1 = don't collapse adjacent separators. \fword() returns the n1th "word" of the string s1, according to the criteria specified by the other parameters. The BREAK SET is the set of all characters that separate words. The default break set is all characters except ASCII letters and digits. ASCII (C0) control characters are treated as break characters by default, as are spacing and punctuation characters, brackets, and so on, and all 8-bit characters. The INCLUDE SET is the set of characters that are to be treated as parts of words even though they normally would be separators. The default include set is empty. Three special symbolic include sets are also allowed: ALL (meaning include all bytes that are not in the break set) CSV (special treatment for Comma-Separated-Value records) TSV (special treatment for Tab-Separated-Value records) For operating on 8-bit character sets, the include set should be ALL. If the GROUPING MASK is given and is nonzero, words can be grouped by quotes or brackets selected by the sum of the following: 1 = doublequotes: "a b c" 2 = braces: {a b c} 4 = apostrophes: 'a b c' 8 = parentheses: (a b c) 16 = square brackets: [a b c] 32 = angle brackets: <a b c> Nesting is possible with {}()[]<> but not with quotes or apostrophes. Returns string: Word number n1, if there is one, otherwise an empty string. Also see: HELP FUNCTION SPLIT C-Kermit>
Previous calling conventions for \fjoin() are undisturbed, including the ability to specify a portion of an array, rather than the whole array:
declare \&a[] = 1 2 3 4 5 6 7 8 9 echo \fjoin(&a[3:7],CSV) 3,4,5,6,7
Using \fsplit() and \fjoin() it is now possible to convert a comma-separated value list into a tab-separated value list, and vice versa (which is not a simple matter of changing commas to tabs or vice versa).
Here is a simple example in which we purge all records of customers who have two or more unpaid bills. The file is sorted so that each license purchase record is followed by its annual maintenance payment records in chronological order.
#!/usr/local/bin/kermit .filename = somefile.csv # Input file in CSV format fopen /read \%c \m(filename) # Open it if fail exit # Don't go on if open failed copy \m(filename) ./new # Make a copy of the file .oldserial = 00000000000 # Multiple records for each serial number .zeros = 0 # Unpaid bill counter while true { # Loop fread /line \%c line # Get a record if fail exit # End of file .n := \fsplit(\m(line),&a,\44,CSV) # Split the fields into an array if not equ "\m(oldserial)" "\&a[6]" { # Have new serial number? # Remove all records for previous serial number # if two or more bills were not paid... if > \m(zeros) 1 { grep /nomatch \m(oldserial) /output:./new2 ./new rename ./new2 ./new } .oldserial := \&a[6] # To detect next time serial number changes .zeros = 0 # Reset unpaid bill counter } if equ "\&a[5]" "$0.00" { # Element 5 is amount paid increment zeros # If it's zero, count it. } } fclose \%c
Rewriting the file multiple times is inelegant, but this is a quick and dirty use-once-and-discard script, so elegance doesn't count. The example is interesting in that it purges certain records based on the contents of other records. Maybe there is a way to do this directly with SQL, but why use SQL when you can use Kermit?
Here is the same task but this time no shelling out, and this time we do change and add some fields and then join the result back into a CSV record and write it out to a new file. The object is to create a record for each license that shows not only the date and purchase price of the license but also the date and amount of the last maintenance payment, and to add new fields for sorting by anniversary (month and day):
#!usr/local/bin/kermit + cd ~/somedirectory # CD to appropriate directory if fail exit 1 # Make sure we did .filename := \%1 # Filename from command line if not def filename { # If none give usage message exit 1 "Usage: \%0: infile [ outfile ]" } fopen /read \%c \m(filename) # Open the input CSV file if fail exit # Make sure we did .output := \%2 # Output filename from command line if not def output { # Supply one if not given .output := New_\m(filename) } fopen /write \%o \m(output) # Open output file if fail exit # Check that we did .serial = 00000000000 # Initialize serial number .licenses = 0 # and license counter fread /line \%c line # First line is column labels if fail exit # Check fwrite /line \%o "\m(line),AMM_DD,AYYYY" # Write new labels line # Remaining lines are license purchases (K95B) followed by zero or more # maintenance invoices (K95BM) for each license. .datepaid = 00/00/0000 # Initialize last maint payment date .amtpaid = $0.00 # Initialize last maint payment amount set flag off # For remembering we're at end of file while not flag { # Loop to read all records fread /line \%c line # Read a record if fail set flag on # If EOF set flag for later .n := \fsplit(\m(line),&a,\44,CSV) # Break record into array if ( flag || equ "\&a[3]" "K95B" ) { # License or EOF if fail exit 1 "FAILED: \v(lastcommand)" if licenses { # If this is not the first license .\&x[5] := \m(amtpaid) # Substitute most recent amount paid .\&x[21] := \m(datepaid) # Substitute most recent date paid void \fsplit(\&x[18],&d,/) # Break up original (anniversary) date # and put mm_dd and yyyy in separate fields for sorting... fwrite /line \%o "\fjoin(&x,CSV),\flpad(\&d[1],2,0)_\flpad(\&d[2],2,0),\&d[3]" if fail exit 1 WRITE # Check for error xecho . # Show progress as one dot per record } if flag break # We're at EOF so we're finished increment licenses # New license - count it array copy &a &x # Keep this record while reading next .serial := \&a[6] # Remember serial number .datepaid = 00/00/0000 # Initial maintenance payment date .amtpaid = $0.00 # and amount continue # and go back to read next record } if not eq "\m(serial)" "\&a[6]" { # Catch out-of-sequence record echo echo "SEQUENCE: \m(serial)..\&a[6]: \&a[7] [\&a[1]]" continue } if equ "\&a[5]" "" .\&a[5] = $0.00 # If amount is empty make it $0.00 if not equ "\&a[5]" "$0.00" { # If amount is not $0.00 .datepaid := \&a[21] # remember date paid .amtpaid := \&a[5] # and amount paid } } fclose ALL # Done - close all files and exit exit 0 Done.
The result imports back into Excel, where it can be sorted, formatted, or otherwise manipulated as desired.
Kermit has several built-in data types, but you can invent your own data types as needed using Kermit's macro feature:
For example:define variablename value
This defines a macro named alphabet and gives it the value abcdefghijklmnopqrstuvwxyz. A more convenient notation (added in C-Kermit 7.0) for this is:define alphabet abcdefghijklmnopqrstuvwxyz
The two are exactly equivalent: they make a literal copy the "right hand side" as the value of the macro. Then you can refer to the macro anywhere in a Kermit command as "\m(macroname)":.alphabet = abcdefghijklmnopqrstuvwxyz
There is a second way to define a macro, which is like the first except that the right-hand side is evaluated first; that is, any variable references or function calls in the right-hand side are replaced by their values before the result is assigned to the macro. The command for this is ASSIGN rather than DEFINE:echo "Alphabet = \m(alphabet)"
which prints:define alphabet abcdefghijklmnopqrstuvwxyz assign backwards \freverse(\m(alphabet)) echo "Alphabet backwards = \m(backwards)"
This kind of assignment can also be done like this:Alphabet backwards = zyxwvutsrqponmlkjihgfedcba
Any command starting with a period is an assignment, and the operator (= or :=) tells what to do with the right-hand side before making the assignment..alphabet = abcdefghijklmnopqrstuvwxyz .backwards := \freverse(\m(alphabet))
In both the DEFINE and ASSIGN commands, the variable name itself is taken literally. It is also possible, however, to have Kermit compute the variable name. This is done (as described in Using C-Kermit, 2nd Ed., p.457), using parallel commands that start with underscore: _DEFINE and _ASSIGN (alias _DEF and _ASG). These are just like DEFINE and ASSIGN except they evaluate the variable name before making the assignment. For example:
would create a macro named ONEONEONE with a value of 111, and:define \%a one _define \%a\%a\%a 111
would create the same macro with the same value, but:define \%a one define number 111 _assign \%a\%a\%a \m(number)
would give the macro a value of "\m(number)".define \%a one define number 111 _define \%a\%a\%a \m(number)
You can use the _ASSIGN command to create any kind of data structure you want; you can find some examples in the Object-Oriented Programming section of the Kermit Script Library. In the following program we use this capability to create a two-dimensional array, or matrix, to hold the all the elements of the CSV file, and then to display the matrix:
The matrix is called a and its elements are a[1][1], a[1][2], a[1][3], ... a[2][1], etc, and you can treat this data structure exactly like a two-dimensional array, in which you can refer to any element by its "X and Y coordinates". For example, if the CSV file contained numeric data you could compute row and column sums using simple FOR loops and Kermit's built-in one-dimensional array data type:fopen /read \%c data.csv # Open CSV file if fail exit 1 .\%r = 0 # Row .\%m = 0 # Maximum columns while true { fread /line \%c line # Read a record if fail break # End of file .\%n := \fsplit(\m(line),&a,\44,CSV) # Split record into items incr \%r # Count this row for \%i 1 \%n 1 { # Assign items to this row of matrix _asg a[\%r][\%i] \&a[\%i] } if > \%i \%m { .\%m := \%i } # Remember width of widest row } fclose \%c # Close CSV file decrement \%m # (because of how FOR loop works) echo MATRIX A ROWS: \%r COLUMNS: \%m # Show the matrix for \%i 1 \%r 1 { # Loop through rows for \%j 1 \%m 1 { # Loop through columns of each row xecho "\flpad(\m(a[\%i][\%j]),6)" } echo } exit 0
Note that the sum arrays don't have to be initialized to zero because Kermit's INCREMENT command treats empty definitions as zero in numeric contexts.declare \&r[\%r] # Make an array for the row sums declare \&c[\%m] # Make an array for the column sums for \%i 1 \%r 1 { # Loop through rows for \%j 1 \%m 1 { # Loop through columns of each row increment \&r[\%i] \m(a[\%i][\%j]) # Accumulate row sum increment \&c[\%j] \m(a[\%i][\%j]) # Accumulate column sum } }
Also note that in C-Kermit 8.0 or later we could use a simpler notation for the array dimensions and subscripts (see this section of the C-Kermit 8.0 update notes):
.m = (width of widest row) .r = (number of rows) declare \&r[r] # Make an array for the row sums declare \&c[m] # Make an array for the column sums for i 1 r 1 { # Loop through rows for j 1 m 1 { # Loop through columns of each row increment \&r[i] \m(a[i][j]) # Accumulate row sum increment \&c[j] \m(a[i][j]) # Accumulate column sum } }