Data Layout

ullage

For the initial version of the language we just need Bool, Number and String to have defined layout. It is probably worth thinking about the future structure of arrays, tuples, and structs though.

Value and Reference Semantics

The plan is that Bool, Number, and String will all have value semantics. That is a modification of a String value in one place will not affect its appearence in another. Such behaviour is referred to as "value semantics". This is similar to Copy types in rust and struct types in C#. I feel that tuple types should also have value semantics. Similar to ValueTuple in C#.

Array and structure types instead will have reference semantics. This means that passing a structure to a fn will allow the function to modify the structure value. This is similar to reference types in C# and &mut reference in Rust.

Type Layouts

For the primitive types we have the following type layouts from language type to LLVM type:

Bool -> i1
Number -> i64

String types are represented as a pair of length, data:

String -> <{u32,[0 x u8]}>*

The value of the string is encoded directly as part of the pair. Allocation of a string uses a variable length array to contain a sequence of utf-8 characters. There are a few problems with this:

The expectation is that strings are rarely modified and we could probably share a single buffer between string instances and use reference counting to control mutable access.
This needs some knowledge of when a value is 'dropped' to free the correct amount of memory.

Given these concerns we could lay a string out as:

String -> <{u32, u32, [0 x u8]}>*

In this representation each string has a pointer to a reference counted backing buffer. This should reduce copy-size of each string and means that a string reference would again have a single easily known size. We still need to know when the reference should be deallocated however.

Garbage Collection

Rather than aiming to control access to data as Rust does the language should provide a garbage collection mechanism to clean up data once no one references it. There are a few alternatives for this:

Don't deallocate - Probably useful to get us off the ground
Reference counting. E.g. Swift's ARC & Python.
Simple mark & sweep GC.

For a full mark and sweep or other collector the code generated needs to insert GC statepoints. For this reason i'm tempted to head towards the second option. I place of statepoints we will need to decide in the lower pass where to insert RC retain and release code to maintain the count. Could this work the same way as Rust's ARC model? In that case the code just needs to know a fixed point in the scope where each value is deallocated.