Author:
Frank da Cruz
Script version: 3.03
Dated:
2022/11/10
This page last updated:
Sun Sep 10 07:27:21 2023
Version 3.02
Recognizes FTP URls as well as HTTP and HTTPS ones.
Changes in html script version 3.02
- New parameter for command line or .htmlrc file:
- noconvertcset=number
(on command line)
.noconvertcset = number
(in .htmlrc file)
- If the number is not zero, the HTML result uses
the same character encoding as the original text file; it is not converted
to UTF-8. In other words, don't bother with this option unless
you don't want the HTML file to come out in UTF-8. One reason for
using this option might be that after the text file is converted to HTML,
you intend to edit the HTML file, but your editing tools don't support
UTF-8.
- URLs in the source text file are converted into clickable links
in the HTML result.
Changes in html script version 3.00
-
Version 3.0 of the html script generates
HTML5 rather than HTML 4.01, and the resulting page always has UTF-8
character encoding, as HTML5 recommends. If the input file uses ISO 8859-1
Latin Alphabet 1 or Microsoft Code Page 1252 character encoding, the html
script converts it to UTF-8 automatically. If it uses some other
character encoding, you can specify it on the command line (see next item).
-
Additional parameters (such as those described below)
can now be included on the command line; for example:
html example.txt "Ejemplos" example.html cset=cp437 lang=es
These parameters are in name=value format with no internal spaces and
no leading dot. The name is the name of any variable used by the
html script, such as the ones listed below.
Character-set names understood by C-Kermit are
listed here.
-
A new directive, prefont, lets you specify a different font to use
for indented material that the html script puts
between <pre>..</pre>. For example if
your document includes a lot of computer sourcecode or tables, you can:
html codesnippets.txt "Program code examples" snippets.html prefont=monospace
But I wouldn't recommend this if you also have other types of indented
material, such as bullet lists, enumerated lists, description lists,
or blockquotes.
Introduction
Because I have been writing code and prose for so many years, I have countless
plain-text files lying around that might be useful to more people if they
were on the Web. In 2004 I wrote a Kermit script to convert text files to
HTML but it was too ambitious, trying to know things that were unknowable,
so the result was often comical. But worse, I noticed that sometimes it
lost chunks of text, and other times it failed altogether with some crazy
error. That was version 1.00. I put it aside for 13 years.
In April 2017, when I uploaded C-Kermit 9.0.304 Dev.21,
I wrote:
For details see
the Update
Notes file (scroll to the bottom and work your way up to where it says
"-- 9.0.302 --" in the August 2011 section; sorry, it's a
old-fashioned plain-text file, 9126 lines at last count, converting it to
HTML would be an all-day project).
It turns out that some of the problems with the first HTML converter script
were in Kermit itself, and this time I tracked them down and fixed them, and
then I wrote a new HTML script that is simpler, cleaner, and less
ambitious, but also more powerful in some ways. This was version 2.00
of May 1, 2017.
What the html script is
It's a C-Kermit script; that is, a program written in the C-Kermit command
language. Presently it runs only on UNIX-based operating systems (if you
don't know what that means, click here). You can
look at the script
by clicking
here, and you can read more about Kermit
scripts here. The html script requires
C-Kermit 9.0.304 Dev.22 or later, because
of fixes that were made in that version to correct the problem with missing
chunks of text.
How to install the html script
First, you need to have C-Kermit 9.0.304 Dev.22 or later
installed on your computer. You can get it here.
Then you can download the getkermitscript
script, which downloads Kermit scripts from the Kermit Project website
and installs them for you. Then use getkermitscript to download and install
the html script.
How to invoke the html script
Assuming you have installed the script on your computer in a directory that
is in your Unix PATH, and it has the filename “html”, then you
can invoke it like this:
html inputfilename "pagetitle" outputfilename
That is, the word “html” followed by the name of the text file
you want to convert, and then optionally, a title for the page enclosed
in doublquotes ("), and a name for the output file. For example:
html notes.txt "My Notes" mynotes.html
If you don't specify a title for the page, the script will use the first
line of the file, but only if it is followed by a blank line. If there is
no such line, the title will be “Untitled”.
If you don't specify an output filename, the output file will be given
the name of the input file, but with an .html exension, for example
notes.txt will produce notes.html, and it will be in the
same directory as the input file, unless you have defined
a destination in your .htmlrc file
(explained below). Let's say you do this, and that
the text file has a first line suitable as a title; then you can just do:
html notes.txt
and the notes.html file will appear in the directory you indicated
in .htmlrc (if any), otherwise in your current directory.
Using the html script in Unix pipelines
The html script can be used in a Unix pipeline (click here to read
about pipelines). This is something new for
Kermit scripts, it has never been possible before, and it depends on
features that were added in C-Kermit 9.0.304 Dev.21 and Dev.22. What it
means is that you can "pipe" the output of any Unix command into the html
script and it will send the result to a file, to your screen, or to the
next program in the pipeline. This is done as follows:
command | html "" "pagetitle" | command
where:
- html is the name of the script;
- "" is a “nothing” in place of the input filename,
so the script knows to read from standard input;
- "pagetitle" is the page title in doublequotes.
If you don't need supply a title, you can simply do:
command | html | command
A more useful
tecnique is to redirect html's output to a file:
command | html > outputfilename
Here's a practical example that illustrates how you make a pipeline
of Unix commands, each one doing its particular job:
man kermit | col -b | html > kermitmanpage.html
Here we turn the Unix man (manual) page for Kermit into an
html document (of
course you could do this with any Unix man page). Man pages are generally
full of backspacing and overstriking and other special effects; the Unix
“col -b” command takes out the special effects, and the
result is piped into the html script, whose output is redirected to an html
file.
How to customize the html script
Here are the default parameters the html script uses to create html files:
.destination = # Destination directory
.cset = utf-8 # Character set of source file (see list)
.perms = 644 # Permissions for result
.lang = en # Language tag (English)
.color = black # Text color
.bg = white # Background color
.font = sans-serif # Font-family
.size = 15px # Font-size (must include units)
.margin = 12px # Margins (must include units)
.max-width = # Maximum page width (pixels, no default)
.noconvertcset = 0 # Set to 1 to suppress character-set conversion
If you put them in your ~/.htmlrc file (that is, a file called
.htmlrc in your Unix login directory), you can edit them however
you wish; for example to change the font size, to specify the directory name
for your website, and so on. The items on the right are comments, they are
ignored by the script. The assignments are on the left and have to be as
shown: a period followed (with no spacing) by a variable name, a space, an
'=' sign, another space, and the value for the variable. If no value is
shown above then the item is not used unless you specify a value for it,
as is the case for 'destination' and 'max-width'.
In version 3.00 (or later) of the html script, you can also put parameters
on the command line after the third parameter as name=value pairs (no
dot, no spaces around '='). Command-line parameter settings override
.htmlrc ones. Example:
html spaghetti.txt "Spaghetti recipe" spaghetti.html max-width=800px size=14px
Units for size, spacing, margins, etc, must be specified since this is required
by HTML5; px is a safe choice.
What the html script does
It reads a plain-text file, which can be in ASCII, ISO 8859-1, Windows Code
Page 1252, UTF-8, or other encoding that you specify, and produces an HTML
version with approximately the same formatting. Here are the rules:
- The text must not contain any HTML markup.
- The text should be in block style, paragraphs not indented.
- Paragraphs are separated by one or more blank lines.
- If URLs are in the file, they must not be broken across lines.
In the resulting HTML file:
- Flushleft text is flowed.
- Indented text (e.g. bullet lists, enumerations, blockquotes) is
preserved as-is.
The html script does not attempt to deal with:
- Multilevel lists;
- Hanging-indent description lists;
- Headings within the text.
- Text where paragraphs are indicated by indentation of the first line,
rather than being separated by blank lines.
The script has no way of knowing when it should switch between a
proportional font and a fixed-width font to preserve the layout of tables or
source code. The assumption is that the author of the plain-text file
formatted it in the desired way; the script preserves the original
formatting except when a proportional font is used in the html result page.
You can override this in your .htmlrc file by specifying a font
such as "monospace" or "courier", but this puts the entire page in the given
font. Aside from that, there is no way to change fonts within a page.
Finally, the html script puts Top and Bottom anchors and links in the page.
To illustrate, here is a long plain-text file (14 years worth of C-Kermit
update notes):
NOTES.TXT
and here is the result of running it through the html script:
ckupdates.html
Improving the results
You can edit any html file produced by this script; for example to put
selected parts of text in bold, italics, or monospace, or to add headings,
etc. But if you run the html script on the same text file again, your
changes will be lost. Therefore the main uses for the html script would be:
- To produce Web pages from text files that will not change;
- As a first step in migrating a text file to html, in which case the
text file will no longer be used and all updates will be made to the html file.
Debugging
To see what the script is doing, put “DEBUG=1” at the beginning
of the command line. Example:
DEBUG=1 html notes.txt
|