PoML

=programming =computer =suggestion

 

 


This page is a proposal for PoML (Polymorphic ML, "pommel"), a variant/extension of OCaml that deals with several common objections to it. MLton and LLVM could be used for most of the compilation steps.


I ported some "bottles of beer" programs here to show what PoML would look like.



Why?


Type inference combining the benefits of static and dynamic language has been a goal for a long time. So has mixing programming languages. One way of achieving this is compilation to C++ using lots of overloading, using C++ as an intermediate language. This is the approach used by haxe and shedskin. But there are various issues with this, eg: slow compilation, poor readability of compiled code which makes interfacing with it hard, forcing compiler writers to deal with type inference and garbage collection...


common complaints about OCaml that PoML handles:


other benefits vs OCaml:


disadvantages vs OCaml:



backwards compatibility:


PoML is designed to be almost backwards compatible with OCaml. It should be possible to convert most existing OCaml code automatically, and open a different standard library interface for OCaml sections. Converting objects to records + "maybe" functions would be easy enough.


The OCaml standard library could be adapted by recompiling external functions and renaming files. The new standard library could be a polymorphic interface to some combination of parts from the OCaml standard library, Batteries + other OCaml libraries, and interfaces to C(++) and FORTRAN libraries compiled to LLVM IR.



reserved keywords:


new: maybe, var, goto, mark, spawn, postmark, accept, alloc, fail, inline


removed: let, in, class, new, method, mutable, fun, function, begin, module, functor, sig

 

changed: match

 

 

syntax changes:

 

Statements end with " . " (surrounded by whitespace) instead of "in". This ending is now always mandatory instead of only needed sometimes; that means "let" is no longer needed to begin statements. This should make PoML seem more familiar to C/C++/Java/Python programmers, and it avoids visual confusion of "let" with variable names from them both being words. A compiler warning should be given when indentation is decreased without ending a statement with " . ".

 

The above change means that "=" can't be an equality operator anymore. So, equality operators are changed to be the same as C.

 

"let _ =" and "let () =" are removed.

 

"<<" replaces ":=" and "<-" for assignment, because it's easier to type.

 

"&" is a universal concatenation operator; bitwise AND is "&&" instead, which also acts as boolean AND.

 

When "-" is preceded but not followed by whitespace, it's a unary negation with high priority.

 

"begin ... end" is removed.

 

Array and string indexing uses array[index] like C and C++ do. This is ambiguous with (function [list]) so lists should be enclosed with parentheses when preceded by something that could be a function, eg: function([list]) .

 

Complete statements are executed in the interpreter automatically, instead of being triggered by ";;". Instead, ";;" ends the most recent "if" statement, and "if" statements are no longer ended by a ";". This makes ( ) on "if" clauses unnecessary.

 

Infix operator definitions with a single input variable are prefix with high priority.

(++) x = x << x + 1 .

++x; increments x and returns ().



"maybe" polymorphism:


(+) = maybe add_int maybe add_float .


During compilation, the type signatures of add_int and add_float are matched against the type signatures of calls to + and the first match is chosen for that call. This system could be used for implicit typecasting, but I don't think that's a good idea; my view is that type conversion can be implied but should always be unary. No match for a call is a compiler error "[function] at [location] does not match [type signature of input]".


maybe f x = ...

This is equivalent to

f x = maybe ... maybe f x .

but it also works if f isn't currently bound. Here's an overloading of addition for vector math:

maybe (+) = map2 (+) .


In that example, map2 (lists, arrays) and (+) (ints, floats) are both "maybe" functions. We want the new function to be valid for every valid combination of those 2 functions. (In this case, all of them.) Types are established from the end of the code upwards, checking each function call for type validity and building a type signature whitelist for a function as it's applied. Possibilities should be tried from left to right, which means the first matching signature whitelisted is for the correct possibility. This whitelist is later used to determine which versions of functions to compile; that can also provide dead code elimination.


Once typechecking is complete, a function version is made for each whitelist entry, with inlining of entries that are just another function, and names are changed to the appropriate version's name. Then compilation proceeds as usual.


Now suppose we have:

zero = maybe 0.0 maybe 0 .

zero + zero .

print = maybe print_int maybe print_float .

print x;

Because we're not inside a "let" statement at the end, the possibilities are matched in order defined and from left to right. Here, print_int is the first possibility for print and it matches the int type possibility of x, so we stop there and do that.


Specifying a type signature calls a function only with a single specified signature, eg:

f1 a b = a,b .

f2 a b = b,a .

swap a b =

maybe f1:int->int->int*int

maybe f2: string -> string-> string * string

.


Modules can also be used with maybe, eg:

module ArrayList = maybe Array maybe List .

assigns a stack of Array,List to ArrayList. 

open maybe Array

stacks the signatures of stuff in Array before any current signatures of those definitions.



misc differences from OCaml:



apostrophe:


Chars are ''c instead of 'c'.


When preceded by whitespace and followed by "(" or whitespace, " ' " means "put the next element on the left" eg:

f1 x1 ' f2 x2 ' (f3 x3) ' f4

is equivalent to

f4 ( ( f3 x3 ) ( f2 ( f1 x1 ) x2 ) )


When " ' " follows "(" it acts like postfix function composition, eg:

map (' g ' f) array


(I wrote this before |> was added to OCaml.)



pattern matching:


function is removed; "... = function |"  becomes  "... = |". More generally, "|" starts a match when it follows (  "=" | "->" | "(" | " . "  )


match is removed; "match foo with | bar -> ..."  becomes  "foo ' | bar -> ..."  which defines an anonymous match function and applies it postfix


"|} ..."  is syntactic sugar for "| _ -> ..."


Like in SML, pattern matching should be able to take multiple arguments, eg:

| 0 y z -> y+z


fun is removed, replaced by multiple argument pattern matching:

(fun x y -> x+y)

becomes

(| x y -> x+y )


"| |" ends a nested match and starts a new element of the higher level match. "| | |" goes up 2 levels, and so on. Alternately, "{|" could be used to end the current match.



fallthrough:


Sometimes fallthrough is useful. This can be explicitly indicated in pattern matching with -> as follows:

input '

| Foo x ->

y = f(x) .

Bar y ->

| Bar y -> f2(y) .


The -> for fallthrough is used in place of a return value. Here, the "Bar y" output of the first match is matched against everything below it, and compiled to fallthrough or a goto to the first match.



switch is =match:


People always seem to want to use pattern matching as a case statement, but it's more about binding names to things. Let's give people what they want. So, "=" in a match means "evaluate the next item and match against its value."

| =foo -> ...

matches against the current value of "foo", instead of binding input to "foo" like 

| foo -> ...

would.


More generally, when part of a match begins with an infix operator, that part of the input is prepended to form an expression which is used as a condition for taking that match.



(restricted) goto:


"Ideally, a compiler should be able to describe its optimizations in the source language." – Hoare

– Knuth, Structured Programming with GOTO Statements


mark [label] in   marks a position.  goto [label]  jumps there. A mark has the same lexical scoping as a "let" statement, and goto can't break out of a function definition. That means a goto can never leave a variable unassigned or cause a type mismatch.


When compiling, note that when a goto jumps somewhere that soon jumps somewhere else, eg a "if" statement, it can be worth inlining that to before the goto. This is a generalization of function inlining and loop unrolling; inlining is often done at the function level but I think that's too coarse. Inlining could also potentially be improved by using some heuristics from GHC for deciding when to inline.



memory allocation:


alloc x : reference_type .

This allocates a reference_type without initializing it, and assigns it to x.


alloc matrix : int[foo][5] .

This creates an array of 5 arrays of [foo] ints.


Because the start is in the type namespace, you can do:

map f (x:'a array) =

alloc new : 'a[length x] . ...



vars are "autorefs" :


A var is a ref that is automatically dereferenced when: assigned to a variable with <<, used as a return value, or given to a function that does not require its undereferenced value. Vars are defined like refs are:

x = var 5 .


These vars replace mutability of arrays and mutable record fields. Array elements are automatically vars.


How do vars work with "maybe" functions? In addition to the whitelist which gets built backwards from function application, functions also have a global signature found that matches all of their possibilities. For example, arithmetic operations would have the type  'a -> 'a -> 'a  . (Input matching this could be used as a first check before going to the whitelist.) If the type signature component has as many "ref"s as the var's type, then it's not dereferenced.

 

Because vars are so important, there might be a special operator for creating them, eg:

x =: 7 .

could by syntactic sugar for

x = var (7) .



underscore parameters:


Functions are often repeatedly used with the same argument names. So, let's have "_" as a function argument be equivalent to copy+pasting that part of its argument names from the function definition. So, with  "f (a,b,c) x = ... ."  doing  "f _ foo"  is equivalent to  "f (a,b,c) foo".



"[printf]" :


MLton doesn't have a printf. So instead, let's evaluate [stuff in brackets] in strings unless escaped as \[ and \]. To be more precise,

"foo [f x] bar"

is equivalent to

( String.concat "" ["foo "; to_string (f x); " bar"] )


If nothing besides whitespace is between brackets then nothing is inserted.


While usually handy, this system can cause problems, eg with regexes. So, there should also be a different way of writing strings that doesn't do that. This can be  :"text":  and should allow  "  and maybe nested pairs inside.



regex pattern matching, parsing:


Pattern matching should be extended to handle concatenation of strings and maybe other arrays. For example, with

f = match

| "birds " & foo & " feather" -> "one [foo] kind"

|} "other"

.

applying

f "birds of a feather"

gives  "one of a kind".


Lists or arrays could be used to indicate alternatives. For example, with

pets = match

| ["dog"; "cat"; "bird"] as pet & ["s"; ""]

-> "yes, a [pet]"

|} "no"

.

applying

pets "birds"

gives "yes, a bird"


"match" (or maybe "parse") is now a keyword that indicates a parse definition.


For more complex parsing, a packrat PEG style parser could be integrated into the language. Here is an example of how that could be done.


"*" and "+" indicate possible repetition, like in regular expressions. When a variable is used in multiple places and/or inside something repeating, the matches are put into a list.


Matching must be exhaustive and always return the same type. Parsing goes recursively inwards for as long as the types match. Parsing returns a recursive variant type structure that can be deconstructed with pattern matching.


With the same variable potentially appearing in multiple parts of a pattern, sometimes you want to do different things depending on what part of a pattern was matched. So, nested pattern matching should be allowed. For example, creating a tree structure based on ( )s:


type tree = Parens of tree list | S of string


f = match

|

( match

| "(" & x & ")" -> x

| x -> S x

)+ as y -> Parens y

| x -> S x

.

f "(one)two(three(four)(five))six"

gives

Parens [Parens [S "one"]; S "two"; Parens [S "three"; Parens [S "four"]; Parens [S "five"]]; S "six"]


That function could also be written as:


f =

f2 = match

| "(" & x & ")" -> x

| x -> S x

.

match

| f2+ as y -> Parens y

| x -> S x

.


This system may involve matches that are not exhaustive, so compiler errors should be thrown when a match that's not exhaustive is applied instead of when it's defined.



records:


MLton supports it, so records can have shared field names. When a record is defined the last defined record type containing that exact set of field names is used. Combined with the "backwards" typechecking scheme, record operations can have more flexibility:


Functions of the form "f {x; y} =" or "f r = ... r.x ..." should be should be adapted to any record types they're applied to that contain (with matching types) all the field names used.


MLton supports anonymous records, so they should be allowed. They could replace labeled arguments, but that would make porting from OCaml harder.


with can be used on records, and moves out of the { }, becoming an infix thing that combines records:

new_record = record1 with record2 with {field=x; new_field=y} .


Record types can include the fields of other records:

type foo = { ... }

type bar = foo with {new_field:type; ...}

Object inheritance can be converted to this. Functions of the form "f (x : a_record_type) =" should be adapted to record types they're used with that include a_record_type.



modules? :


If we have anonymous records and add opening records like modules, records can replace modules and functors, eg:

functr {arg; arg; arg} =

f1 = [stuff] .

f2 = [stuff] .

{f1; f2}

.

output_module = functr input_record .

This is more elegant and powerful than the OCaml module/functor system.


"struct...end" can then be considered another way to define records, creating a record from every definition in scope at the end. With underscore arguments, we can write something like:

functr (arg, arg, arg) = struct

f1 = ... .

f2 = ... .

end .

out =

open input_record .

output_record = functr _ .

open output_record .

This would make it easy for functors to use new overloaded definitions for basic operators like "+" .


 

mixed lists:


Things like parser functions can require complex nested list literals as input. In OCaml, each element must be labeled with a constructor even if that information is redundant with its type, which can be cumbersome. So: when a list or array literal has a type annotation for a variant type, automatically add constructors to each element that's not already of that variant type, applying the first legal constructor in the type definition, and recursively going inside a nested list or array literal before adding a constructor to it.


example:

type tree = List of tree list | String of string .

x = ["look"; ["ma"; "no"]; "constructors"] : tree .


In OCaml you would need to write:

let x = List [String "look"; List [String "ma"; String "no"]; String "constructors"]



forwhile loops:


"for" and "while" loops are combined like ALGOL 68 loops. An example of the full syntax is

for i = 1 to 100 by 2 while x < y do ... done

and everything except "do ... done" is optional.


"do ... done" is like "while true do ... done" but it has a return value, which cannot be (). If the last executed statement of a loop cycle does not return () then the loop ends and returns that value.


Looping over an array is common. "map" and "iter" can be used for that, but don't work as well when using several arrays, and some people find them more confusing than for loops. So,

for i of array do ... done

should be equivalent to

for i = 0 to (size array - 1) do ... done



compiler:


I like the following compilation scheme:

PoML ->

new (PEG parser? Marpa parser?) frontend -> AST ->

MLton "Elaborate" replacement -> CoreML (MLton IL) ->

current MLton modules -> Machine ->

-> new backend -> LLVM IR ->

LLVM compilation -> executable


MLton is written in SML. Few people use SML, and it doesn't really have parsing tools, so using it for the new compiler components is probably impractical. Passing compiler data structures between languages is also problematic, and so is writing a compiler to an obscure target language. A feasible bootstrap process might be:


SML is similar enough to PoML for porting to be relatively easy. MLton is >100k lines of code, but if you remove the frontend, backend, libraries, and lexer, it's considerably smaller.



compiler usage:


The compiler concatenates (text piped to it, split by newline) and (its arguments, split by " "), taking that as its input. This allows compilation using utilities like find.


Module names are derived from file paths by:

eg:

"/dir/dog 1.0.2.pml" -> Dog

"/dir/src/cat stuff/array v4.pml" -> Cat.Array


If multiple files go in the same module, then they're concatenated in the order they were input. This could be controlled in searches by alphabetical ordering, eg:

"/src/foo p1/stuff" and "/src/foo p2/stuff"


If a module name would be "main" then it's put in the parent module and goes before any submodule files, eg:

"/src/dog/main/main.pml" -> Dog

"/src/main.pml" is the main source file if compiling /src/


Compilation can be "precompilation" or ("linking" and/or "full"). Calling the compiler with "-pre" generates LLVM IR file(s) and .pmli interface file(s) that can be fed to the compiler later, one of both for each .pml file (but not .pmli files) being compiled. Precompilation is for libraries and large projects, to get faster compilation during development. Using "-link" generates a single LLVM IR file. Using "-full" generates an executable.


When the compiler is given a directory it searches for *.pml and *.pmli in that directory and adds the results to its input. Files beginning with "#" are ignored. When precompiling items, pmli files are put with their parent pml files.


By default, when there's a pml file and a pmli file with the same name in the same folder, the less recently modified one is skipped. There could be compiler options to always skip .pml files or .pmli files.


This system means a library can be installed with a single curl + unzip command. (However, LLVM IR is not platform independent, so multiple versions may be needed.)


.pmli files contain external function declarations of (top level function definitions with known types) in the source file.


A function has a "known type" if: its global signature is accepted by a typecheck, OR it has a declared type signature, eg:

f : maybe int -> int maybe float -> float = f .


The generated external function declarations are of the form:

[ocaml function name] =

external "LLVM IR function name": string -> int -> unit

.

 

[ocaml function name] =

maybe external "1st LLVM IR function name": @int array -> int

maybe external "2nd LLVM IR function name": @float array -> float

...


The "@" in the type signatures is a promise to the compiler that a mutable input is not modified by the function.


Using a .pmli file causes code in a corresponding LLVM file to be included. A LLVM file is chosen by searching the directory of the .pmli file for a filename matching (part of the .pmli filename, with the LLVM file extension appended). If .pmli files share an external function name, the latter function is renamed for the current compilation.


With "-link" the types of polymorphic functions to compile are determined by code calling them. With "-pre", function usage and type signature declarations are used to determine which variants of polymorphic functions are compiled and which functions are exposed for calling from outside the current compilation.



lib.cfg:


If a directory is being compiled, then it's searched for a file named "lib.cfg". The contents of lib.cfg look like:

Array

Library.Module

-compiler_flag

Foo.Bar

Everything in the library and ("src" folder of the folder being compiled) matching one of those module names is added if not already being compiled. Multiple matches are a compiler error. Lines starting with "-" are compiler flags.


The importance of easy installation to popularity shouldn't be underestimated. So, how about the following system: When a listed module is not found, the compiler takes a URL from a text file somewhere, appends the module name, downloads that into its library, checks its lib.cfg file for anything not present, downloads more if necessary, and then compiles.



inline C code:


If compilation with .pmli files is possible, then inline code in other languages that can compile to LLVM code (probably C or C++) would be possible too. This code could be stripped out and compiled separately, then combined like LLVM code with a .pmli file. Inline C code should look like:


inline.c triple x : int -> int = arbitrary_delimeter

int triple (int x)

{

r = x*3;

return r;

}

arbitrary_delimeter .

 

inline would then need to be a reserved keyword. The other languages used for inline code would need to have ABI compatibility with PoML or an ABI wrapper, so the PoML compiler should have ABI compatibility with Clang. If multiple functions are in the inline code, then they're all compiled and the definition refers to the one with the same name.

 

arbitrary_delimeter is anything that can be a variable name, and is used to show where the inline code block ends.

 


message passing concurrency:


p = spawn [expression] .

This spawns a process evaluating an expression and assigns the process handle to p.


postmark lbl of [type] .

This declares a message label lbl and its message type, and assigns it to lbl.


Message.send p lbl bar;

This sends a message bar, labeled lbl, to the process with handle p. Could be a reserved keyword deliver instead.


y = accept lbl x -> x | lbl2 x -> f x | _ -> f 0 .

This matches against messages not yet accepted. If a message labeled lbl has arrived since the last time a message labeled lbl was accepted by the current process, then the value of the last message labeled lbl sent to the current process is assigned to y. _ does not accept a message, it just provides an action for when none of the labels have messages. If there are no matches (which means no _ match is described) then the current process first waits until a matching message is received, by passing control and checking message flags when it resumes.  accept lbl  is allowed as shorthand for  accept lbl x -> x .


y = accept p | p2 x -> f x .

When accept is given a process handle of a child, it accepts the return value of that spawned expression, which is automatically sent to its parent. This allows functions to be spawned without modification, eg:

p = spawn f x .

result = accept p .


Message.check lbl

This is true if a message labeled lbl has been delivered to the current process since its last accept of lbl, otherwise false.


Message.parent

The handle of the process that spawned the current process.


Message.self

The handle of the current process.



Parallel GC is probably the biggest issue with compiling ML for multiple cores. With PoML, you'll know if a spawned function is getting a ref and whether it modifies that ref. If so, put all processes that might use (the object that ref points to) after that point in the same GC group. GC groups do GC at the same time the way Haskell's compiler does GC. Spawned pure functions can use independent/concurrent GC and copy messages.

 

 




back to index