Design of a Python to Postscript compiler

NOTE: I was learning Postscript during the vacation for Chinese New Year. This was meant to be completed in 2 weeks but I heavily underestimated the workload as I always do (and I'd really like to focus on other things now). This blogpost mainly consists of short notes from the thought process. Maybe one day I'll work on this idea again and hopefully produce a working compiler.

Supported subset of Python

I can't possibly adapt the whole python into postscript (even if it's possible it'll be way too much work) so this subset will have to do.

-- this is originally taken from Python's documentation:
--     https://docs.python.org/3/library/ast.html#abstract-grammar

module Pyps
{
    mod = Module(stmt* body)
        -- normally this means it's entered thru a REPL. but i probably
        -- don't have the time to make a repl so... :|
        | Interactive(stmt* body)

    stmt = FunctionDef(identifier name, arguments args,
                stmt* body,
                expr? returns)

         | Return(expr? value)
         | Delete(expr* targets)
         | Assign(expr* targets, expr value)
         | AugAssign(expr target, operator op, expr value)
         | AnnAssign(expr target, expr annotation, expr? value)
         | If(expr test, stmt* body, stmt* orelse)
         | While(expr test, stmt* body, stmt* orelse)
         | For(expr target, expr iter, stmt* body, stmt* orelse)

         -- i'm not going to add any error handling stuff right now because i
         -- haven't quite figured out how things work in postscript yet

         | Pass
         | Break
         | Continue
    
    expr = BoolOp(boolop op, expr* values)
         | BinOp(expr left, operator op, expr right)
         | UnaryOp(unaryop op, expr operand)
         | IfExp(expr test, expr body, expr orelse)
         | Dict(expr* keys, expr* values)
         | ListComp(expr elt, comprehension* generators)
         | DictComp(expr elt, comprehension* generators)
         | Compare(expr left, cmpop* ops, expr* comparators)
         | Call(expr func, expr* args, keyword* keywords)
         -- NOTE: slice, when applied on arrays, will be considered as the
         -- equivalent to "making a new array & fill it with for-loop`, and
         -- is handled the same way.
         | Subscript(expr value, expr slice)
         | Name(identifier id)
         | List(expr* elts)
         | Tuple(expr* elts)
        
    boolop = And | Or

    operator = Add | Sub | Mult | MatMult | Div | Mod | Pow | LShift
                 | RShift | BitOr | BitXor | BitAnd | FloorDiv

    unaryop = Invert | Not | UAdd | USub

    cmpop = Eq | NotEq | Lt | LtE | Gt | GtE | Is | IsNot | In | NotIn
    
    comprehension = (expr target, expr iter, expr* ifs, int is_async)

    -- could have to extra work to support keyword args but i'll leave it
    -- out for now.
    arguments = (arg* args)

    arg = (identifier arg, expr? annotation)
}

Type inference

In order to compile python into postscript correctly a certain degree of type inference is needed (see below, "General compiling scheme").

General compiling scheme

Handling expressions

Normally we can translates simple expressions to postscript in a bottom-up manner. For example (lowest + highest) / 2 in python can be translated into lowest highest add 2 div in postscript. Other constructs (listed below) are not that easy.

Lists, tuples and dicts literals

Python lists can be translated into a postscript dict with a single value field. The reason for this is explained below, in the "Operations on lists & dicts" section.
- e.g. [3, 4, 5] will become << /value [ 3 4 5 ] cvlit >>.
Python tuples can be translated into postscript packedarrays, then cvlit and readonly is added.
- e.g. (3, "abc") will become { 3 (abc) } cvlit readonly.
Python dicts can be translated into postscript dicts with << >> and name objects.
- e.g. {"abc": 3, "def": 4} will become << /abc 3 /def 4 >> cvlit.
- note that in python you can use expressions as keys in dict literals, e.g. executing a = "abc"; b = {a: 3} will assign b with {"abc": 3}. this is done in postscript with cvn. for this example, the translation result for the {a: 3} part will be << a cvn 3 >> cvlit.

List and dict comprehensions are discussed later.

Operations on lists & dicts

There are no direct equivalent of OOP objects & method calls in postscript and it definitely isn't worth it to implement OOP in postscript (we already have enough "oop" lang bullshit now please leave postscript alone); but whether we implement oop constructs or not we'll have to manage the operations ourselves.

Take list.append in python for an example. Arrays are fixed size in postscript so there's no list.append; to append something to an array we have to create a new one with a bigger size, but this will break reference, so python lists are translated into an explicit reference of an array implemented with a postscript dict.

for lists:

A1 + A2: note that in python adding two lists creates new list, so this is okay:

<< /value
    A1 /value get aload pop
    A2 /value get aload pop
    A1 /value get length
    A2 /value get length
    add array astore
>>

A.append(B):

% setup for later `put`
A

% unpack the original array
A /value get aload

% create a new array
dup length 1 add array

% delete the original array
exch pop

% creates the new array
B exch astore

% put it back into the reference
/value exch put

A.pop():

A A /value get aload
dup length 1 sub array
exch pop exch pop
astore
/value exch put

for dicts:
- A[B] and A.get(B):
  The difference between A[B] and A.get(B) in Python is that the former throws an error when B does not exist in A while the latter simply returns None. This issue is easy to solve: the latter is essentially equivalent to None if B not in A else A[B] and can be translated as such. No, the real issue is in generating the code for the keys - specifically numeric keys. In postscript one cannot directly cvn a number, in order to turn it into a name object one must first turn it into a string; but postscript strings are fixed-length & have to be pre-allocated, so one should preferrably knows how long the string representation of the said number would be beforehand (which when it's dynamic it's impossible) - or one can just allocate one big long string as the buffer because postscript cvs truncates the string for you.
  
  In the simple case of string keys, the former can simply be translated into A B cvn get because postscript get do throw an error when the key does not exist in dict. For the latter, as mentioned before, can be wrapped in an ifelse with a known test:
```
A B cvn known { null } { A B cvn get } ifelse        
```
- A.keys():
  Postscript forall puts the key then the value for every key-value pair in dict. If these two objects are not discarded from the stack, they'll remain throughout the forall; we can thus keep only the key objects and put them into an array:
```
A { pop } forall A length array astore        
```
- A.items():
  Same as above with a different forall proc - instead of discarding the value object we pack the key and the value into a 2-array:
```
A { 2 array astore } forall A length array astore        
```
- A.pop(B):
```
A B cvn undef        
```

Strings & byte-strings

Postscript only has what should probably be called byte-strings; a simple way to solve this issue is to translate all string literals into big-endian utf32 (e.g. "你好世界", which is U+4F60, U+597D, U+4E16 and U+754C, will be translated into (\000\000\117\140\000\000\131\175\000\000\116\026\000\000\165\114)) and multiply or divide the lengths & indices by 4 as we need.

Operations on strings & byte-strings

The translations of string operations and byte-string operations are the same as long as they don't invole lengths & indices. Possible implementation of a few operations are listed below.

A1 + A2:

A1 length A2 length add string
dup 0 A1 putinterval
dup A1 length A2 putinterval

A.join(B):

there are probably a lot of solutions more simple than this mess.

B length 0 eq { () } {
    B length 1 eq { B 0 get } {
        B 0 get
        1 1 B length 1 sub {
            2 dict begin
                    /i exch def
                    /prev exch def
                prev length A length add B i get length add string
                dup 0 prev putinterval
                dup prev length A putinterval
                dup prev length A length add B i get putinterval
            end
        } for
    } ifelse
} ifelse

len(A):

A length
% if A is infered to be a string (rather than byte-string), `4 div` is inserted after.

Subscripts & slices of (byte-)strings, tuples & arrays

Like mentioned above, postscript only has byte-strings. If we do encode unicode strings with big-endian utf32, then String[Start:End] can be translated into String Start 4 mul End 4 mul Start 4 mul sub getinterval. The translation for byte-string & array subscripts & slices are the same because getinterval is polymorphic: A[Start:End] will simply become A Start End Start sub getinterval.

When there's a Step part, one will have to fallback to the old "append-loop" method, i.e. for any A[Start:End:Step] it's equivalent to this:

# for arrays.
# some variable representing the stack top.
_ = []
for i in range(Start, End, Step):
    _.append(A[i])

# for bytestrings.
_ = b''
for i in range(Start, End, Step):
    _ = _ + A[i]

which is then translated like a normal for loop with range (discussed later).

Calls

A call like Func(arg0, arg1, ..., argn) is simply translated into arg0 arg1 ... argn Func when Func is an identifier. When it's not an identifier (e.g. something like some_dict["some_key"](args); some people do like their table-driven method in python) chances are a packed array representing Func will be pushed onto the stack rather than actually being executed, so the general way is translating it into arg0 arg1 ... argn Func exec.

Handling assignment statements

When the lvalue is an identifier:

Variable = Expr

it can be simply translated into this:

/Variable
    % insert whatever the heck `Expr` translates into
def

So-called "augmenting assignment" is similar: A += B is simply A = A + B and can be translated as if it's A = A + B.

When the lvalue is a subscript, i.e. something like A[B] = Expr, the translation depends on the type of A: it can be translated into A B Expr put (if A is an array) or A B cvn Expr put (if A is a dict).

Handling function definitions

The way to create a "local scope" in postscript is to create a new dict (with dict) and push it onto/pop it from the dict stack (done by begin and end). e.g. the factorial function can be defined in postscript as:

% fac: [n] --- [n!]
/fac {
    % create a dict and use it as a scope
    1 dict begin

    % define `n` in said scope
    /n exch def

    % the main part
    n 1 le
        { 1 }
        { n n 1 sub fac mul }
    ifelse

    % exit the scope
    end
}

There's an extra detail needs to be taken care of:

In postscript the size of dict when created is fixed. (note that dict takes a "size" parameter)
In postscript languagelevel 1, when this size is exceeded, a dictfull error will occur. Even though ghostscript (the current go-to implementation for postscript) supports up to languagelevel 3, I decide it's better to be aware of this.

This requires knowing all the variable names used within a function definition (which is actually trivial). Function definitions are thus translated into postscript like the one shown below:

def F(a0, a1, ..., an):
    do_things()

/F {
    % where "!n" is the number of identifiers directly
    % used as a lvalue inside the function (including the parameters)。
    !n dict begin

    % the parameter names are in reverse order because how things work in postscript...
    [/an ... /a1 /a0] {exch def} forall

    % do_things
    do_things

    end
} def

Handling `if`

Single clause if is trivial:

if A:
    Body

A { Body } if

The translations of if-else statement and if-else expression are exactly the same; the only difference is whether there'll be result on the stack after the execution:

Value1 if A else Value2

# or:
if A:
    Body1
else:
    Body2

A { Value1 } { Value2 } ifelse

% or:
A { Body1 } { Body2 } ifelse

if-elif-else is simply a series of stacking if-else, and is converted as such:

if A.1:
    Body1
elif A.2:
    Body2
elif A.3:
    Body3
else:
    Body4

# this is equivalent to:
if A.1:
    Body1
else:
    if A.2:
        Body2
    else:
        if A.3:
            Body3
        else:
            Body4

A.1 { Body1 }
{
    A.2 { Body2 }
    {
        A.3 { Body3 }
        {
            Body4
        } ifelse
    } ifelse
} ifelse

Handling `while`

Easy to notice simple while loops without continue like this:

while A:
    if A.1:
        break
    else:
        do_work()

can be translated into postscript like this:

{
    A not { exit } if
    
    A.1 { exit } { do_work } ifelse
} loop

When continue is involved it's a little bit trickier. See below.

Handling `break` and `continue`

break and continue often appears within the same while loop like this:

while A:
    if A.1:
        break
    elif A.2:
        continue
    else:
        do_work()

To summarize we have:

loop (done by while and for)
a way to exit the loop completely (done by break)
a way to prematurely proceed to the next iteration (done by continue)

But in postscript there's no directly similar construct as continue; instead we have:

loop (done by for, forall, loop and other looping constructs.)
a way to exit the loop (done by exit)
another way to exit the loop while kind of signifying an error like raising an exception (done by stop)
a way to detect such exit (done by stopped)

This can be demonstrated with the following example: exiting by exit is not captured by stopped, while exiting by stop, is.

{ { exit } loop } stopped =  % prints `false`
{ { stop } loop } stopped =  % prints `true`

Easy to see continue as a special kind of exit that the execution flow will go back into the loop again upon exit, and the action of "going back into the loop" can be done by wrapping another layer of loop. So for the python example above we can construct this:

{
    % while-condition not met. exit immediately.
    A not { exit } if

    % this whole `loop-stopped-if` block will exit the *outer* loop when
    % executing the program fragment inside lead to a `stop` being executed.
    {

        % this loop is necessary to execu
        {
            % while-condition not met. exit immediately.
            % note that this is the while-condition; we convert this:
            %     while A:
            %         ...
            % into this here:
            %     while True:
            %         if not A: break
            %         ...
            A not { stop } if

            % "if A.1: break"
            A.1
                { stop } 
                {
                    % "elif A.2: continue" and "else: do_work()"
                    A.2
                        { exit }
                        { do_work }
                    ifelse
                }
            if
        } loop
    } stopped
        % we exit the loop here because `stop` (which in terms corresponds
        % to a `break`) was executed. if `exit` is executed, it'll continue
        % to execute because there's the outer `loop`.
        { exit } if
} loop

Handling `for`

Here's a few thing that needs to be careful about python's for:

there's binding involved. e.g. for i in iter binds value to i.
the binding is not local to the for block.

So the following python program works perfectly fine:

# oh yes the `_` is not a "special" wildcard identifier either - it's just
# a normal variable name like any other variable names.
total_sum = 0
for i, _ in enumerate(range(10)):
    pass

print(i, _)
# prints "9 9"

`for` with `range`

Note that postscript for is "inclusive", i.e. when the looping variable (the i part in for i in iter) is exactly equal to the end value, the body is also executed, so while you calculate Sum(i=1→100) in python like this:

total_sum = 0
for i in range(1, 101):
    total_sum += i

in postscript it's this:

/total_sum 0 def
% note that it's 100 not 101
1 1 100 { total_sum add /total_sum exch def } for

The actual End for postscript is a bit tricky to calculate; the easy way out is to once again using the equivalent while. So the program:

for i in range(Start, End, Step):
    Body

will be seen as:

i = Start
while i < End:
    Body
    i += Step

and thus results in:

/i Start def
{
    i End lt not { exit } if
    {
        i End lt not { stop } if
        Body
        /i i Step add def
    } stopped { exit } if
} loop

`for` with `enumerate`

for with enumerate is a different story:

Unlike range, this time we have no way to manually determine how to loop, which means we'll have to use forall.
In python, an iterator of a dict is an iterator of that dict's keys, which means we'll have to (1) insert a pop in the front of the generated loop body whenever the iterator is a dict, or (2) generate the array of keys before hand, and in order to do it correctly in compile-time both method needs some form of type inference.
- which means (2) is extra work no matter what, so we'll ditch that solution.

So a for loop like this:

for Index, Value in enumerate(Iter):
    Body

can be translated into this:

% we also have to manage the index on our own.
/Index 0 def
{
    % if `Iter` is a dict then insert `pop` here.
    /Value exch def
    /Index Index 1 add def
} forall

`for` with other iterators

The translation of this kind of for is similar to the translation of for with enumerate.

Comprehensions

Comprehensions can be done in a similar way of for. The list comprehension form [Expr(Var) for Var in Iter if Cond(Var)] is equivalent to:

_ = []
for Var in Iter:
    if Cond(Var):
        _.append(Expr(Var))

List comprehension forms with multiple for-clause, e.g. [Expr(Var1, Var2) for Var1 in Iter2 if Cond1(Var1) for Var2 in Iter2 if Cond2(Var2)] is equivalent to:

_ = []
for Var1 in Iter1:
    if Cond1(Var1):
        for Var2 in Iter2:
            if Cond2(Var2):
                _.append(Expr(Var1, Var2))

Dict comprehensions are similar but one faces the same numeric key problem as translating dict subscript (mentioned above). {Key(i): Value(i) for i in Iter if Cond(i)} is equivalent to:

_ = {}
for i in Iter:
    if Cond(i):
        _[Key(i)] = Value(i)

and thus can be translated as such.