Alexander Gromnitsky's Blog :: 2023-07-10 :: Exploring pre-1990 versions of wc(1)

Exploring pre-1990 versions of wc(1)

Latest update: 2023-07-24 23:21:00

Can you blame a tool that doesn't support a standard X when the tool was written before X was invented?

While a reasonable person would certainly answer, "You can't" (and add, "You shouldn't"), sometimes it could be educational to examine how the absence of X has shaped the assumptions of the tool authors.

Take one of the simplest Unix utils, wc(1). This is a man page from v7 Research UNIX (1979):

Notice the clarity, laconism (almost non-existent in our days), along with the following definition:

"A word is a maximal string of characters delimited by spaces, tabs or newlines."

One may presume, that this is the equivalent of what in JavaScript would be a basic:

> "та за шо\n".split(/[\t\n ]/).filter(Boolean).length
3

(I deliberately didn't use \s & +.)

The answer from the JavaScript snippet is exactly what the modern wc(1) from coreutils prints:

$ echo та за шо | wc -w
3

wc from v7 Research UNIX has a different opinion, though:

$ locale | grep LC_CTYPE
LC_CTYPE=uk_UA.UTF-8
$ cc -Wall usr/src/cmd/wc.c -o wc 2>&1 | grep warning | wc -l
7
$ echo та за шо | ./wc -w
      0

Obviously, they didn't have UTF8 in 1979 (it was invented in 1992), but if a word is just a sequence of bytes that are not \t, \n, or space, even such a prehistoric version of wc should have analyzed the input correctly?

The earliest version of coreutils' wc that I was able to find is from 1989. I fetched it from a github mirror, and what do you know:

$ curl -s https://raw.githubusercontent.com/coreutils/coreutils/b25038ce9a234ea0906ddcbd8a0012e917e6c661/src/wc.c > wc.coreutls.1989.c
$ cat patch
24c24,28
< #include "system.h"
---
> #include <errno.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <sys/stat.h>
> #include <string.h>
$ patch wc.coreutls.1989.c -o wc.c < patch > /dev/null
$ cc -Wall wc.c -o wc 2>&1 | grep warning | wc -l
4
$ echo та за шо | ./wc -w
      3

It gives the correct answer, despite not knowing about UTF8 in 1989 either.

I became curious as to why AT&T's version failed.

v7 1979

The src is pleasantly small. The following gist with a loop explains the deal:

linect = 0;
wordct = 0;
charct = 0;
token = 0;
for(;;) {
    c = getc(fp);
    if (c == EOF)
        break;
    charct++;
    if(' '<c&&c<0177) {
        if(!token) {
            wordct++;
            token++;
        }
        continue;
    }
    if(c=='\n')
        linect++;
    else if(c!=' '&&c!='\t')
        continue;
    token = 0;
}
/* print lines, words, chars */
wcp(wd, charct, wordct, linect);

wordct variable increments only if the current char falls within the range of space (040 in octal) and 177 (DEL, the last entry in ASCII). The majority of our input was

$ echo та за шо | hexdump -b
0000000 321 202 320 260 040 320 267 320 260 040 321 210 320 276 012
000000f

slightly above the range.

v8 1985

The total rewrite of wc.c they did for v8 didn't resolve our issue:

$ cc -Wall usr/src/cmd/wc.c -o wc 2>&1 | grep warning | wc -l
17
$ echo та за шо | ./wc -w
      0

From an aesthetic standpoint, the beginning of v8's wc.c looks absolutely fabulous. It's such a beaut, that I'll leave it here as a screenshot:

This time, instead of putting the main loop in main(), they designated a separate function for the counting:

count(fd, name)
    char *name;
{
    register token=0, n;
    register unsigned char *cp;
    register long chars=0, lines=0, words=0;
    while((n=read(fd, buf, sizeof buf))>0){
        chars+=n;
        cp=buf;
        while(--n>=0)
            switch(type[*cp++]|token){
            case NL:
                lines++;
                break;
            case NL|TOKEN:
                lines++;
                token=0;
                break;
            case SP:
                break;
            case SP|TOKEN:
                token=0;
                break;
            case ORD:
                token=TOKEN;
                words++;
                break;
            case ORD|TOKEN:
                break;
            case JUNK:
            case JUNK|TOKEN:
                break;
            }
    }
    close(fd);
    print(chars, words, lines, name);
    tchars+=chars;
    twords+=words;
    tlines+=lines;
}

Although token variable is used in a clever way to keep track of whether the previous character was part of a word, it never changes its value from 0 because everything > 0177 v8 Research UNIX considers JUNK, not worthy of noticing.

v9, 1986

Unfortunately, the vast differences in wc.c between v8 and v9 didn't alter the word counting:

$ diff -u v8/usr/src/cmd/wc.c v9/cmd/wc.c
--- v8/usr/src/cmd/wc.c 1985-07-05 08:48:38.000000000 +0400
+++ v9/cmd/wc.c 1988-01-15 19:51:45.000000000 +0300
@@ -65,7 +65,7 @@
 {
        register i, fd, status=0;
        if(argc>1 && argv[1][0]=='-'){
-               opt=++argv[1];
+               opt= ++argv[1];
                --argc, argv++;
        }
        if(argc==1)

v10, 1989

A big year, Poland disentangled itself from the USSR, and the Berlin Wall fell.

The humble wc.c got a new rewrite.

$ cc -Wall cmd/wc.c -o wc 2>&1 | grep warning | wc -l
16
$ echo та за шо | ./wc -w
      3

I can't believe my eyes! It works. What happened? Did the sudden competition from GNU force them to fix some v9 bugs? Or did somebody at AT&T receive an email in CP437 from Italy? Who knows.

Look at this exemplar, casually jumping between the loops, a model we should strive for:

count(fd, name)
    char *name;
{
    register n;
    register unsigned char *cp, *cpend;
    register long chars=0, lines=0, words=0;

    for(;;){
        if((n=read(fd, buf, NBUF))<=0)
            goto done;
        chars+=n;
        cp=buf;
        *(cpend = buf+n) = ' ';
        cpend[1] = 'a';
dospace:
        for(;;){
            if(*cp == '\n')
                lines++;
            if(space[*cp++] == 0){
                if(cp > cpend)
                    break;
                goto doword;
            }
        }
    }
    for(;;){
        if((n=read(fd, buf, NBUF))<=0)
            goto done;
        chars+=n;
        cp=buf;
        *(cpend = buf+n) = ' ';
        cpend[1] = 'a';
doword:
        for(;;){
            if(space[*cp++]){
                if(cp > cpend)
                    break;
                words++;
                if(cp[-1] == '\n')
                    lines++;
                goto dospace;
            }
        }
    }
done:
    close(fd);
    printout(chars, words, lines, name);
    tchars+=chars;
    twords+=words;
    tlines+=lines;
}

I should end this with a note that I write sarcastic comments in jest. I adore everything the folks at Bell Labs did. I obtained the code examples for v7-v10 from the tuhs.org archive.

Tags: ойті
Authors: ag