Alexander Gromnitsky's Blog

An HTTP client in Bash

Air Date:
Latest update:

I recently saw a tweet where a guy was asking how to download curl within a minimal Debian container that didn't have any scripting language installed except for Bash, no wget, or anything like that.

If such a container has apt-get, but you lack permission to run it, there is a reliable way to force apt-get to download a .deb file with all its dependencies under a regular user, but we won't discuss that here.

I got curious about how hard it would be to write a primitive HTTP get-only client in Bash, as Bash is typically compiled with "network" redirection support:

$ exec 3<> /dev/tcp/www.gnu.org/80
$ printf "%s\r\n" 'HEAD /robots.txt HTTP/1.1' >&3
$ printf "%s\r\n\r\n" 'Host: www.gnu.org' >&3
$ cat <&3
HTTP/1.1 200 OK
Date: Sun, 11 Feb 2024 07:02:40 GMT
Server: Apache/2.4.29
Content-Type: text/plain
Content-Language: non-html
…

This could've been useful before the days of TLS everywhere, but it won't suffice now: to download a statically compiled curl binary from Github, we need TLS support and proper handling of 302 redirections. Certainly, it's possible to cheat: put the binary on our web server and serve it under plain HTTP, but that would be too easy.

What if we use ncat+openssl as a forward TLS proxy? ncat may serve as an initd-like super-server, invoking "openssl s_client" on each connection:

$ cat proxy.sh
#!/bin/sh
read -r host
openssl s_client -quiet -no_ign_eof -verify_return_error "$host"
$ ncat -vk -l 10.10.10.10 1234 -e proxy.sh

The 1st thing we need in the bash-http-get client is URL parsing. It wouldn't have been necessary if Github served files directly from "Releases" pages, but it does so through redirects. Therefore, when we grab Location header from a response, we need to disentangle its hostname from a pathname.

Ideally, it should work like URL() constructor in JavaScript:

$ node -pe 'new URL("https://q.example.com:8080/foo?q=1&w=2#lol")'
URL {
  href: 'https://q.example.com:8080/foo?q=1&w=2#lol',
  origin: 'https://q.example.com:8080',
  protocol: 'https:',
  username: '',
  password: '',
  host: 'q.example.com:8080',
  hostname: 'q.example.com',
  port: '8080',
  pathname: '/foo',
  search: '?q=1&w=2',
  searchParams: URLSearchParams { 'q' => '1', 'w' => '2' },
  hash: '#lol'
}

StackOverflow has various examples of how to achieve that using regular expressions, but none of them were able to parse the example above. I tried asking ChatGPT to repair the regex, but it only made it worse. Miraculously, Google's Gemini supposedly fixed the regex on the second try (I haven't tested it extensively).

$ cat lib.bash
declare -A URL

url_parse() {
    local pattern='^(([^:/?#]+):)?(//((([^:/?#]+)@)?([^:/?#]+)(:([0-9]+))?))?(/([^?#]*))?(\?([^#]*))?(#(.*))?'
    [[ "$1" =~ $pattern ]] && [ "${BASH_REMATCH[2]}" ] && [ "${BASH_REMATCH[4]}" ] || return 1
    URL=(
        [proto]=${BASH_REMATCH[2]}
        [host]=${BASH_REMATCH[4]}
        [hostname]=${BASH_REMATCH[7]}
        [port]=${BASH_REMATCH[9]}
        [pathname]=${BASH_REMATCH[10]:-/}
        [search]=${BASH_REMATCH[12]}
        [hash]=${BASH_REMATCH[14]}
    )
}

Next, we need to separate headers from a response body. This means looking for the 1st occurrence of \r\n\r\n. Sounds easy,

grep -aobx $'\r' file | head -1

until you decide to port the client to a BusyBox-based system like Alpine Linux. The latter has grep that doesn't support -ab options. There are some advices on employing od(1), but no examples. If we print a file using a 2-column format:

0000000 68
0000001 20
0000002 3a
…

where the left column is a decimal offset, we can convert the 1st 32KB of the response into a single line and search for the pattern using grep -o:

od -N $((32*1024)) -t x1 -Ad -w1 -v "$tmp" | tr '\n' ' ' | \
    grep -o '....... 0d ....... 0a ....... 0d ....... 0a' | \
    awk '{if (NR==1) print $7+0}'

Here's the full version of the client that supports only URLs with the https protocol. It saves the response in a temporary file and looks for the \r\n\r\n offset. If the HTTP status code was 200, it prints the body to stdout. If it was 302, it extracts the value of the Location header and recursively calls itself with a new URL.

#!/usr/bin/env bash

set -e -o pipefail
. "$(dirname "$(readlink -f "$0")")/lib.bash"

tmp=`mktemp fetch.XXXXXX`
trap 'rm -f $tmp' 0 1 2 15
eh() { echo "$*" 1>&2; exit 2; }

[ $# = 3 ] || eh Usage: fetch.bash proxy_host proxy_port url
proxy_host=$1
proxy_port=$2
url=$3

get() {
    url_parse "$1"; [ "${URL[proto]}" = https ] || return 1

    exec 3<> "/dev/tcp/$proxy_host/$proxy_port" || return 1
    echo "${URL[hostname]}:${URL[port]:-443}" >&3
    printf "GET %s HTTP/1.1\r\n" "${URL[pathname]}${URL[search]}${URL[hash]}" >&3
    printf '%s: %s\r\n' Host "${URL[hostname]}" Connection close >&3
    printf '\r\n' >&3
    cat <&3
}

get "$url" > "$tmp" || eh ':('
[ -s "$tmp" ] || eh 'Empty reply, TLS error?'

offset_calc() {
    if echo 1 | grep -aobx 1 >/dev/null 2>&1; then # gnu-like grep
        grep -aobx $'\r' "$tmp" | head -1 | tr -d '\r\n:' | \
            xargs -r expr 1 +
    else                                      # busybox?
        od -N $((32*1024)) -t x1 -Ad -w1 -v "$tmp" | tr '\n' ' ' | \
            grep -o '....... 0d ....... 0a ....... 0d ....... 0a' | \
            awk '{if (NR==1) print $7+0}'
    fi || echo -1
}
offset=`offset_calc`
headers() { head -c "$offset" "$tmp" | tr -d '\r'; }
hdr() { headers | grep -m1 -i "^$1:" | cut -d' ' -f2; }

status=`head -1 "$tmp" | cut -d' ' -f2`
case "$status" in
    200) [ "$offset" = -1 ] && offset=-2 # invalid responce, dump all
         tail -c+$((offset + 2)) "$tmp"
         [ "$offset" -gt 0 ] ;;
    302) headers 1>&2; echo 1>&2
         hdr location | xargs "$0" "$1" "$2" ;;
    *)   headers 1>&2; exit 1
esac

It should work even on Alpine Linux of FreeBSD:

$ ./fetch.bash 10.10.10.10 1234 https://github.com/stunnel/static-curl/releases/download/8.6.0/curl-linux-arm64-8.6.0.tar.xz > curl.tar.xz
HTTP/1.1 302 Found
Location: https://objects.githubusercontent.com/…
…
$ file curl.tar.xz
curl.tar.xz: XZ compressed data, checksum CRC64

Tags: ойті
Authors: ag

Comparing Compression

Air Date:
Latest update:

Do you benchmark compression tools (like xz or zstd) on your own data, or do you rely on common wisdom? The best result for an uncompressed 300MB XFS image from the previous post was achieved by bzip2, which is rarely used nowadays. How does one quickly check a chunk of data against N popular compressors?

E.g., an unpacked tarball of Emacs 29.2 source code consists of 6791 files with a total size of 276MB. If you were to distribute it as a .tar.something archive, which compression tool would be the optimal choice? We can easily write a small utility that answers this question.

$ ./comprtest ~/opt/src/emacs/emacs-29.2 | tee table
tar: Removing leading `/' from member names
szip             0.59   56.98        126593557
gzip             9.21   72.70         80335332
compress         3.57   57.45        125217137
bzip2           17.28   78.08         64509672
rzip            17.61   79.50         60336377
lzip           113.61   81.67         53935898
lzop             0.67   57.14        126121462
xz             111.03   81.89         53295220
brotli          13.10   78.14         64336399
zstd             1.13   73.77         77179446

comprtest is a 29 LOC long shell script. The 2nd column here indicates time in seconds, the 3rd column displays 100(1-compressedorig) , representing space saving in % (higher % is better), & the 4th column shows the final result in bytes.

Then we can sort the table by the 3rd column & draw a bar chart:

$ sort -nk3 table | cpp -P plot.gp | gnuplot -persist

If you're wondering why all of a sudden the C preprocessor becomes part of it, read on.

comprtest expects either a file as an argument or a directory (in which case it creates a plain .tar of it first). Additional optional arguments specify which compressors to use:

$ ./comprtest /usr/libexec/gdb gzip brotli
gzip             0.60   61.17          6054706
brotli           1.17   65.84          5325408

The gist of the script involves looping over a list of compressors:

archivers='szip gzip compress bzip2 rzip lzip lzop xz brotli zstd'
…
for c in ${@:-$archivers}; do
    echo $c
    case $c in
        szip   ) args='< "$input" > $output' ;;
        rzip   ) args='-k -o $output "$input"' ;;
        brotli ) args='-6 -c "$input" > $output' ;;
        *      ) args='-c "$input" > $output'
    esac

    eval "time -p $c $args" 2>&1 | awk '/real/ {print $2}'
    osize=`wc -c < $output`

    echo $isize $osize | awk '{print 100*(1-$2/($1==0?$2:$1))}'
    echo $osize
    rm $output
done | xargs -n4 printf "%-8s  %11.2f  %6.2f  %15d\n"
  • Not every archive tool has gzip-compatible CLI.
  • We are using a default compression level for each tool with the exception of brotli, as its default level 11 is excruciatingly slow.
  • szip is an interface to the Snappy algorithm. Your distro probably doesn't have it in its repos, hence run cargo install szip. Everything else should be available via dnf/apt.

Bar charts are generated by a gnuplot script:

$ cat plot.gp
$data <<E
#include "/dev/stdin"
E
set key tmargin
set xtics rotate by -30 left
set y2tics
set ylabel "Seconds"
set y2label "%"
set style data histograms
set style fill solid
plot $data using 2 axis x1y1 title "Time", \
     "" using 3:xticlabels(1) axis x1y2 title "Space saving"

Here is where the C preprocessor comes in handy: without an injected "datablock" it won't be possible to draw a graph with 2 ordinates when reading data from stdin.

In an attempt to demonstrate that xz is not always the best choice, I benchmarked a bunch of XML files (314MB):

$ ./comprtest ~/Downloads/emacs.stackexchange.com.tar
szip             0.59   63.70        119429565
gzip             7.18   77.59         73724710
compress         4.03   67.17        108015563
bzip2           21.37   83.36         54751478
rzip            17.42   85.93         46304199
lzip           119.70   85.06         49151518
lzop             0.67   63.63        119667058
xz             125.80   85.55         47559464
brotli          13.56   82.52         57509978
zstd             1.07   79.40         67766890

Tags: ойті
Authors: ag

Disk images as archive file formats

Air Date:
Latest update:

As a prank, how do you create an archive in Linux that ⓐ cannot be opened in Windows (without WSL2 or Cygwin), ⓑ can be opened in MacOS of FreeBSD?

Creating an .cpio or .tar.xz won't cut it: file archivers such as 7-Zip are free & easy to install. Furthermore, sending an ext4 image, generated as follows:

$ truncate -s 10M file.img
$ mkfs.ext4 file.img
$ sudo mount -o loop file.img /somewhere
$ sudo cp something /somewhere
$ sudo umount /somewhere

doesn't help nowadays, for 7-Zip opens them too1. Although disk cloning utils like FSArchiver can produce an image file from a directory, they are exclusive to Linux.

It boils down to this: which filesystems can be read across Linux/MacOS/FreeBSD that Windows file archivers don't recognise? This rules out fat/ntfs/udf, for they are too common, or f2fs/nilfs2, for they are Linux-only.

The only viable candidate I found is XFS. Btrfs was a contender, but I'm unsure how to mount it on Mac.

Below is a script to automate the creation of prank archives. It takes any zip/tar.gz (or anything else that bsdtar is able to parse) & outputs an image file in the format specified by the output file extension:

sudo ./mkimg file.zip file.xfs

It requires sudo, for mount -o loop can't be done under a regular user.

#!/bin/sh

set -e

input=$1
output=$2
type=${2##*.}
[ -r "$input" ] && [ "$output" ] && [ "`id -u`" = 0 ] || {
    echo Usage: sudo mkimg file.zip file.ext2 1>&2
    exit 1
}
mkfs=mkfs.$type
cmd() { for c; do command -v $c >/dev/null || { echo no $c; return 1; }; done; }
cmd bsdtar "$mkfs"

cleanup() {
    set +e
    umount "$mnt" 2>/dev/null
    rm -rf "$mnt" "$log"
    [ "$ok" ] || rm -f "$output"
}

trap cleanup 0 1 2 15
usize=`bsdtar tvf "$input" | awk '{s += $5} END {print s}'`
mnt=`mktemp -d`
log=`mktemp`

case "$type" in
    msdos|*fat) size=$((1024*1024 + usize*2)); opt_tar=--no-same-owner ;;
    ext*|udf  ) size=$((1024*1024 + usize*2)) ;;
    f2fs      ) size=$((1024*1024*50 + usize*2)) ;;
    btrfs     ) size=$((114294784 + usize*2)) ;;
    nilfs2    ) size=$((134221824 + usize*2)) ;;
    xfs       ) size=$((1024*1024*300 + usize*2)) ;;
    jfs       ) size=$((1024*1024*16 + usize*2)); opt=-q ;;
    hfsplus   )
        size=$((1024*1024 + usize*2))
        [ $((size % 4096)) != 0 ] && size=$((size + (4096-(size % 4096)))) ;;
    *) echo "$type is untested" 1>&2; exit 1
esac
rm -f "$output"
truncate -s $size "$output"
$mkfs $opt "$output" > "$log" 2>&1 || { cat "$log"; exit 1; }

mount -o loop "$output" "$mnt"
bsdtar -C "$mnt" $opt_tar --chroot -xf "$input"
[ "$SUDO_UID" ] && chown "$SUDO_UID:$SUDO_GID" "$output"
ok=1

.xfs files start at a size of 300MB, even if you place a 0-length file in it, but bzip2 compresses such an image into 6270 bytes.

To mount an .xfs under a regular user, use libfsxfs.


  1. 7z -i prints all supported formats.

Tags: ойті
Authors: ag

Home streaming & inetd-style servers

Air Date:
Latest update:

The easiest way to stream a movie is to serve it using a static HTTP server that supports range requests. For this, even Ruby's Webrick will do the job. Type this in a directory with your The Sopranos collection:

$ ruby -run -ehttpd . -b 127.0.0.1 -p 8000

& point mpv or vlc to a particular episode:

$ mpv http://127.0.0.1/s01e01.mp4

This should work as if you're playing a local file. To play a movie with a web browser, make sure the web server returns correct Content-Type headers. A container format counts too: e.g., Chrome doesn't like mkv.

Can we do something similar without the HTTP server? Depending on the container format, it's possible to feed mpv with a raw TCP stream. We'll lose seeking, but if we were creating, say, a YouTube Shorts or Facebook Reels competitor, this won't matter, for consumers of these kind of clips don't care much about that.

The most primitive solution requires only 2 utils:

  1. ncat, that can listen on a socket & fork an external program when someone connects to the former:

    $ cat mickeymousetube
    #!/bin/sh
    
    export movie="${1:?Usage: ${0##*/} file.mkv [port]}"
    port=${2:-61001}
    type pv ncat || exit 1
    
    __dirname=$(dirname "$(readlink -f "$0")")
    ncat -vlk -e "$__dirname/pv.sh" 127.0.0.1 $port
    
  2. pv, the famous pipe monitor that can limit a transfer rate; without the limiter, mpv eats all available bandwidth:

    $ cat pv.sh
    #!/bin/sh
    pv -L2M "$movie"
    

    The -L2M option means max 2MB/s.

Then run mickeymousetube in one terminal & mpv tcp://127.0.0.1:61001 in another to play a clip.

tcplol

How hard may it be to replace ncat with our custom script? What ncat does with -e option is akin to what inetd did back in the day:

Steps performed by inetd

(The illustration is from Stevens' UNIX Network Programming.)

Instead of creating a server that manages sockets, one writes a program that simply reads from stdin and outputs to stdout. All the intricacies of properly handling multiple clients are managed by the super-duper-server.

There is no (x)inetd package in modern distros like Fedora, as systemd has superseded it with socket activation.

Suppose we have a script that asks a user for his nickname & greets him in return:

$ cat hello.sh
#!/bin/sh
uname 1>&2
while [ -z "$name" ]; do
    printf "Nickname? "
    read -r name || exit 1
done
echo "Hello, $name!"

To expose it to a network, we can either write 2 systemd unit files & place them in ~/.config/systemd/user/, or opt for a tiny 37 LOC Ruby script instead:

require 'socket'

usage = 'Usage: tcplol [-2v] [-h 127.0.0.1] -p 1234 program [args...]'
…

server = TCPServer.new opt['h'], opt['p']
loop do
  client = server.accept
  cid = client.remote_address.ip_unpack.join ':'

  warn "Client #{cid}"
  pid = fork do
    $stdin.reopen client
    $stdout.reopen client
    $stderr.reopen client if opt['2']
    client.close
    exec(*ARGV)
  end
  client.close
  Thread.new(cid, pid) do
    Process.wait pid
    warn "Client #{cid}: disconnect"
  end
end

This is a classic fork server that uses a thread for each fork to watch out for zombies. The linked tcplol script performs an additional clean-up in case the server gets hit with a SIGINT, for example.

ncat, on the other hand, operates quite differently:

  1. it creates 2 pipes;
  2. after each new connection, it forks itself;
  3. it connects the 2 pipes to the child's stdin/stdout;
  4. (in the parent process) it listens on a connected socket using select(2) syscall and transfers data to/from the child using the 2 pipes; we'll talk about select(2) and the concept of multiplexing later on.

Anyhow, if we run our much simpler "super-server":

$ ./tcplol -v -p 1234 ./hello.sh

& connect to it with 2 socat clients, the process tree under Linux would look like:

$ pstree `pgrep -f ./tcplol` -ap
ruby,259576 ./tcplol -v -p 8000 ./hello.sh
  ├─hello.sh,259580 ./hello.sh
  ├─hello.sh,259587 ./hello.sh
  ├─{ruby},259583
  ├─{ruby},259588
  └─{ruby},259589

The dialog:

$ socat - TCP4:127.0.0.1:8000
Nickname? Dude
Hello, Dude!

(Why socat? We can use ncat as well, but the latter doesn't close its end of a connection; it hangs in CLOSE_WAIT until one presses Ctrl-D.)

To play a movie, run

$ ./tcplol -v -p 8000 ./pv.sh file.mkv

using a modified version of pv.sh script:

#!/bin/sh
echo Streaming "$1" 1>&2
pv -L2M "${1?Usage: pv.sh file}"

Then connect to the server with

$ mpv tcp://127.0.0.1:8000

Mickey mouse SOCKS4 server

inetd-style services can perform various actions, not just humbly write to stdout. Nothing prevents such a service from opening a connection to a different machine and relaying bytes from it to the tcplol clients.

To illustrate the perils of the low-level socket interface, let's write a crude, allow-everyone socks4 service and test it with curl. The objective is to retrieve security.txt file from Google using a TLS connection like so:

$ curl -L https://google.com/.well-known/security.txt --proxy socks4://127.0.0.1:8000

As a socks4 client, curl sends a request to 127.0.0.1:8000 with an IP+port to which it wants our service to establish a connection (meaning we don't have to resolve google.com domain name ourselves). We decode this and promptly send an acknowledgment reply. This is the 1st part of socks4.rb which we are going run under tcplol:

$stdout.sync = true

req = $stdin.read 8 + 1
ver, command, port, ip = req.unpack 'CCnN' # 8 bytes
abort 'Invalid CONNECT' unless ver == 4 && command == 1

ip = ip.to_s(16).scan(/.{2}/).map(&:hex) # [a,b,c,d]
res = [0, 90].pack('C*') +               # request granted
      [port].pack('n') + ip.pack('C*')
$stdout.write res

What should we do next? As soon as curl gets the value of 'res' variable, it eagerly starts sending a TLS ClientHello message to 127.0.0.1:8000. At this point, we don't need to analyse exactly what it sends--our primary concern is relaying traffic to and fro as quickly as possible without losing bytes.

To temporarily test that we have correctly negotiated SOCKS4, we can conclude the script with the ncat call:

exec "ncat", "-v", ip.join('.'), port.to_s

It should work. However, we can also rewrite that line in pure Ruby using the Kernel.select method. What we need here is to monitor 2 file descriptors in different modes to react to changes in their state:

  1. in reading mode: stdin and a TCP socket to google.com;
  2. in writing mode: the TCP socket to google.com.

(We assume that stdout is always available.) This kind of programming--being notified when an IO connection is ready (for reading or writing, for example) on a set of file descriptors--is called IO multiplexing. Most web programmers never encounter it because the socket interface is many levels below the stack they are working in, but it may be interesting sometimes to see how the sausage is made.

Replace exec "ncat" line with:

require 'socket'

s = TCPSocket.new ip.join('.'), port
wbuf = []
BUFSIZ = 1024 * 1024
loop do
  sockets = select [$stdin, s], [s], [], 5

  sockets[0].each do |socket|   # readable
    if socket == $stdin
      input = $stdin.readpartial BUFSIZ
      wbuf << input
    else
      input = socket.readpartial BUFSIZ
      $stdout.write input
    end
  end

  sockets[1].each do |socket|   # writable
    wbuf.each { |v| socket.write v }
    wbuf = []
  end
end

We're establishing a connection to google.com and then initiating the monitoring of two file descriptors in an endless loop. The select method blocks until one or both of these file descriptors become available for reading or writing. The last argument to it is a timeout in seconds.

When select unblocks, sockets[0] contains an array of file descriptors available for reading. If it's stdin, we read the data the OS kernel thinks is obtainable & save such a chunk to wbuf array. If it is a socket to google.com, we read some bytes from it & immediately write them to stdout for curl to consume.

sockets[1] contains an array of file descriptors available for writing. We only have 1 google.com socket here, to which we write the contents of wbuf array.

The script terminates when $stdin.readpartial returns an EOFError. This indicates to curl that the other party has closed its connection.

If you run socks4.rb under tcplol:

./tcplol -v -p 8000 ./socks4.rb

and observe errors tcplol prints from socks4.rb, you'll see that curl makes 2 requests to google.com, for the first one yields 301.

$ curl -sLI https://google.com/.well-known/security.txt --proxy socks4://127.0.0.1:8000 | grep -E '^HTTP|content-length'
HTTP/2 301
content-length: 244
HTTP/2 200
content-length: 246

Tags: ойті
Authors: ag