Extracting Bitmap From Plan 9 Fonts

Update 2022.7.17: To turn a font (e.g. .ttf) into Plan9 subfonts you need to:

To combine two (or more) plan9 subfonts which shares the same range you have to:

  1. retreve the bitmaps from all the subfonts;
  2. somehow slice them together into one single bitmap;
  3. convert this bitmap into a new subfont file.

For example, assume font A has 0x2000,0x2002,...,0x2008 and font B has 0x2001,0x2003,...,0x2009. Normally the subfont of A will only contain the glyphs for 0x2000, 0x2002, etc. and the subfont for B will only contain the glyphs for 0x2001, 0x2003, etc.. In this case, if both subfonts are used for the range 0x2000~0x2009 in their respective .font file, directly replacing this entry in .font of A will add the glyphs from B but remove the glyphs from A. To actually make use of both sets of glyphs, you can either:

This has been on my todo list for almost a year. I wanted to do this because this gb/ font (noted "script fonts?" on the wiki page) is (1) actually simplified Chinese fonts that supports the GB2312 standard (probably the reason for the folder name gb) and (2) it's quite an old font. I've only seen it once *ages* ago used by a weird-ass text editor with built-in special input method. I have absolute zero idea how the font got into Plan 9 (if you know this please let me know).

I thought this is gonna take days because I need to painfully read through pure *nix-flavour (?) manual pages, turns out it can be done in one afternoon. Why the heck did I keep this on my todo list for this long..??.?

The font files can be found here. Here is a part of the bitmap I've extracted:

The format of Plan 9 font files

A "font" in plan9 is made up of many "subfont" because:

A Font may contain too many characters to hold in memory simultaneously.

-- cachechars(2) from Plan 9 manual pages

A subfont file contains the bitmap data for a range of Unicode codepoints. Its file name specifies the range or the start of the range. For example the first row of the bitmap above comes from the file named Song.4e00.16, which means this file contains the data of U+4E00~ for the 16px size Song font. The range is specified by the .font index files.

A subfont file is made up of three parts:

When displaying the bitmap for the character c at point p:

The format of the image in plan9 containing the glyphs (described in image(6)) is described as follows:

The decompression of the blocks goes as follows.

With all this knowledge we can finally write the code:

import sys

def main(filename):
    with open(filename, 'rb') as f:
        s = f.read()

    if s.startswith(b'compressed\n'):
        print('compressed.')
        s = s[len(b'compressed\n'):]
        compressed = True
    else:
        print('not compressed.')
        compressed = False

    header1 = s[:12*5].decode('utf-8')
    s = s[12*5:]
    print('chan', header1[:12])
    print('r.min.x', header1[12:24])
    print('r.min.y', header1[24:36])
    print('r.max.x', header1[36:48])
    print('r.max.y', header1[48:])

    rminy = int(header1[24:36].strip())
    rmaxx = int(header1[36:48].strip())
    rmaxy = int(header1[48:].strip())

    if compressed:
        miny = rminy
        code_word_list = []
        # for each block:
        while miny < rmaxy:
            raw_block_header = s[:2*12]; s = s[2*12:]
            maxy = int(raw_block_header[:12].decode('utf-8'))
            nb = int(raw_block_header[12:].decode('utf-8'))
            # extract code words from block
            raw_block_data = s[:nb]; s = s[nb:]
            i = 0
            raw_block_data_len = len(raw_block_data)
            while i < raw_block_data_len:
                if raw_block_data[i] >= 128:
                    this_word_len = raw_block_data[i] - 128+1
                    i += 1
                    code_word_list.append(raw_block_data[i:i+this_word_len])
                    i += this_word_len
                else:
                    a = (raw_block_data[i]&0b01111100)>>2
                    b = ((raw_block_data[i]&0b00000011)<<8)|(raw_block_data[i+1])
                    code_word_list.append((a, b))
                    i += 2
            miny = maxy
        # decompress from code words
        res = b''
        i = 0
        for code_word in code_word_list:
            if type(code_word) is bytes:
                res += code_word
                i += len(code_word)
            else:
                a, b = code_word
                lenx = a+3
                off = b+1
                for _ in range(lenx):
                    res += bytes([res[-off]])
                    i += 1

        with open(f'{filename}.pbm', 'w') as f:
            print(f'P1\n{rmaxx} {rmaxy}', file=f)
            for k in res:
                z = f'{k:08b}'
                for zz in z:
                    print(zz, ' ', sep='', end='', file=f)

        n = int(s[:12].decode('utf-8'))
        height = int(s[12:24].decode('utf-8'))
        ascent = int(s[24:36].decode('utf-8'))
        s = s[36:]
        info_list = []
        for _ in range(n+1):
            raw_info = s[:6]
            x = raw_info[0]|(raw_info[1]<<8)
            top = raw_info[2]
            bottom = raw_info[3]
            left = raw_info[4]
            width = raw_info[5]
            info_list.append((x, top, bottom, left, width))
            s = s[6:]

        with open(f'{filename}.info.txt', 'w') as f:
            print(f'n={n}, height={height}, ascent={ascent}', file=f)
            for x, top, bottom, left, width in info_list:
                print(f'x={x}, top={top}, bottom={bottom}, left={left}, width={width}', file=f)


if __name__ == '__main__':
    main(sys.argv[1])

Maybe one day I'll learn how TrueType works and compile a .ttf out of this... There's surprisingly quite a lot of stuff to learn.