Exploring pre-1990 versions of wc(1)
Latest update:
Can you blame a tool that doesn't support a standard X when the tool
was written before X was invented?
While a reasonable person would certainly answer, "You can't" (and
add, "You shouldn't"), sometimes it could be educational to examine
how the absence of X has shaped the assumptions of the tool authors.
Take one of the simplest Unix utils, wc(1). This is a man page from v7
Research UNIX (1979):
Notice the clarity, laconism (almost non-existent in our days), along
with the following definition:
"A word is a maximal string of characters delimited by spaces, tabs
or newlines."
One may presume, that this is the equivalent of what in JavaScript
would be a basic:
> "та за шо\n".split(/[\t\n ]/).filter(Boolean).length
3
(I deliberately didn't use \s
& +
.)
The answer from the JavaScript snippet is exactly what the modern
wc(1) from coreutils prints:
$ echo та за шо | wc -w
3
wc
from v7 Research UNIX has a different opinion, though:
$ locale | grep LC_CTYPE
LC_CTYPE=uk_UA.UTF-8
$ cc -Wall usr/src/cmd/wc.c -o wc 2>&1 | grep warning | wc -l
7
$ echo та за шо | ./wc -w
0
Obviously, they didn't have UTF8 in 1979 (it was invented in 1992),
but if a word is just a sequence of bytes that are not \t, \n, or
space, even such a prehistoric version of wc should have analyzed the
input correctly?
The earliest version of coreutils' wc that I was able to find is
from 1989. I fetched it from a github mirror, and what do you know:
$ curl -s https://raw.githubusercontent.com/coreutils/coreutils/b25038ce9a234ea0906ddcbd8a0012e917e6c661/src/wc.c > wc.coreutls.1989.c
$ cat patch
24c24,28
< #include "system.h"
---
> #include <errno.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <sys/stat.h>
> #include <string.h>
$ patch wc.coreutls.1989.c -o wc.c < patch > /dev/null
$ cc -Wall wc.c -o wc 2>&1 | grep warning | wc -l
4
$ echo та за шо | ./wc -w
3
It gives the correct answer, despite not knowing about UTF8 in 1989
either.
I became curious as to why AT&T's version failed.
v7 1979
The src is pleasantly small. The following gist with a loop explains
the deal:
linect = 0;
wordct = 0;
charct = 0;
token = 0;
for(;;) {
c = getc(fp);
if (c == EOF)
break;
charct++;
if(' '<c&&c<0177) {
if(!token) {
wordct++;
token++;
}
continue;
}
if(c=='\n')
linect++;
else if(c!=' '&&c!='\t')
continue;
token = 0;
}
/* print lines, words, chars */
wcp(wd, charct, wordct, linect);
wordct
variable increments only if the current char falls within the
range of space (040 in octal) and 177 (DEL, the last entry in
ASCII). The majority of our input was
$ echo та за шо | hexdump -b
0000000 321 202 320 260 040 320 267 320 260 040 321 210 320 276 012
000000f
slightly above the range.
v8 1985
The total rewrite of wc.c they did for v8 didn't resolve our issue:
$ cc -Wall usr/src/cmd/wc.c -o wc 2>&1 | grep warning | wc -l
17
$ echo та за шо | ./wc -w
0
From an aesthetic standpoint, the beginning of v8's wc.c looks
absolutely fabulous. It's such a beaut, that I'll leave it here as a
screenshot:
This time, instead of putting the main loop in main()
, they
designated a separate function for the counting:
count(fd, name)
char *name;
{
register token=0, n;
register unsigned char *cp;
register long chars=0, lines=0, words=0;
while((n=read(fd, buf, sizeof buf))>0){
chars+=n;
cp=buf;
while(--n>=0)
switch(type[*cp++]|token){
case NL:
lines++;
break;
case NL|TOKEN:
lines++;
token=0;
break;
case SP:
break;
case SP|TOKEN:
token=0;
break;
case ORD:
token=TOKEN;
words++;
break;
case ORD|TOKEN:
break;
case JUNK:
case JUNK|TOKEN:
break;
}
}
close(fd);
print(chars, words, lines, name);
tchars+=chars;
twords+=words;
tlines+=lines;
}
Although token
variable is used in a clever way to keep track of
whether the previous character was part of a word, it never changes
its value from 0 because everything > 0177 v8 Research UNIX considers
JUNK
, not worthy of noticing.
v9, 1986
Unfortunately, the vast differences in wc.c between v8 and v9 didn't
alter the word counting:
$ diff -u v8/usr/src/cmd/wc.c v9/cmd/wc.c
--- v8/usr/src/cmd/wc.c 1985-07-05 08:48:38.000000000 +0400
+++ v9/cmd/wc.c 1988-01-15 19:51:45.000000000 +0300
@@ -65,7 +65,7 @@
{
register i, fd, status=0;
if(argc>1 && argv[1][0]=='-'){
- opt=++argv[1];
+ opt= ++argv[1];
--argc, argv++;
}
if(argc==1)
v10, 1989
A big year, Poland disentangled itself from the USSR, and the Berlin
Wall fell.
The humble wc.c got a new rewrite.
$ cc -Wall cmd/wc.c -o wc 2>&1 | grep warning | wc -l
16
$ echo та за шо | ./wc -w
3
I can't believe my eyes! It works. What happened? Did the sudden
competition from GNU force them to fix some v9 bugs? Or did somebody
at AT&T receive an email in CP437 from Italy? Who knows.
Look at this exemplar, casually jumping between the loops, a model we
should strive for:
count(fd, name)
char *name;
{
register n;
register unsigned char *cp, *cpend;
register long chars=0, lines=0, words=0;
for(;;){
if((n=read(fd, buf, NBUF))<=0)
goto done;
chars+=n;
cp=buf;
*(cpend = buf+n) = ' ';
cpend[1] = 'a';
dospace:
for(;;){
if(*cp == '\n')
lines++;
if(space[*cp++] == 0){
if(cp > cpend)
break;
goto doword;
}
}
}
for(;;){
if((n=read(fd, buf, NBUF))<=0)
goto done;
chars+=n;
cp=buf;
*(cpend = buf+n) = ' ';
cpend[1] = 'a';
doword:
for(;;){
if(space[*cp++]){
if(cp > cpend)
break;
words++;
if(cp[-1] == '\n')
lines++;
goto dospace;
}
}
}
done:
close(fd);
printout(chars, words, lines, name);
tchars+=chars;
twords+=words;
tlines+=lines;
}
I should end this with a note that I write sarcastic comments in
jest. I adore everything the folks at Bell Labs did. I obtained the
code examples for v7-v10 from the
tuhs.org
archive.
Tags: ойті
Authors: ag