iota

NOTE: this is an old draft, the current design document is at https://dmytrish.net/lib/iota/

A low-level language aiming to be small and principled:

  • Reifies abstractions from full linear/ordered dependent typing to the actual bare-metal bytes manipulation
  • Nothing should be forced to be unsafe at any possible level of compilation or execution, as long as you can or want to prove safety.
  • A limited, ascetic and opinionated set of language features
  • Unambiguous, simple, minimalistic syntax; context-free grammar
  • Having nested dot-namespaces and context-aware completion is a must.
  • Make everything (threads, stacks, allocators, effects, assumptions) as explicit as possible, while keeping it not too verbose.
  • Every bit of data is transparent, structured and accounted for. Its meaning can be erased only explicitly.
  • Everything that happens at runtime can be reflected on. A systems language is more than just a compiler blueprint.

Inspired by:

  • Golang: syntax, syntactic minimalism
  • Rust: safe memory management, traits, namespaces
  • Zig: compile-time evaluation, explicit allocators.
  • Swift: witness tables, layout-agnostic structs.
  • Austral: linear types, capabilities, runtime minimalism
  • Lean: proofs, properties, standard library, syntax?

References:

Syntax

Go-style, minimalistic (no semicolons, less parentheses, less special characters). AST maps into const-datastructures that can be manipulated by const-functions to enable metaprogramming.

# a one-line comment.

# language is built around databags, which are like Lua tables.
bag_example = (
	key1 = value
	/*0 = */ val
	nested_bag = (one = 1, two = 2)  # if in one line, use commas.
)
# <name>.iota files are databags themselves. Directories are databags.

# functions
sqr [T: has_mul] (x: T): T = { x * x }
s1 = sqr 12            # eagerly, comptime evaluated to 12
s2 = { sqr 12 }        # curly braces make it lazy

less_than (y: int) (x: int) : bool = { x < y }
less_than_42 = less_than 42    # func(x:int): bool { x < 42 }

But: Go syntax is not really minimalistic when it comes to lambdas and complex, Lean-level details. Application by juxtaposition is good, parenthesis are not necessary. Can Haskell/Lean syntax be made less alien?

The t: T syntax for the typing relation is too entrenched to try to express this any other way?

Semantics

Dimensions of the language:

  • before-time / const-time / run-time
  • capabilities: const-time (e.g. async) / run-time capabilities
  • the data language. Data notation. Code is data. Run-time data. Encoding vs schema.
  • value aspects: types, traits, lifetimes and linearity, mutability (?).
  • source code: unstructured text, semistructure (i.e. "json"), structure (i.e. AST + modules/packages), richer-then-Lisp homoiconicity based on databags. Progressive semantic enrichment: text $\rightarrow$ structure $\rightarrow$ AST (+tags) $\rightarrow$ const-modules. Attribute grammars / ECS-style AST.
  • run-time model: packages, dynamic capabilities, memory management (allocations). Error handling.
  • practicalities: complexity budget, familiarity, syntax, tooling, testing.
  • scaling up and down: easy to start.

Evaluation

The language is stratified into:

  • the before-time, everything fixed about the language: AST entities, the full list of capabilities, keywords, kinds, prototypes of builtin types, etc. Schemas for types: U8, UPtr, Array, Dict, AST nodes, EnumCapabilities, Fn0, Fn1, ... ?
  • the compile-time, types, AST/metaprogramming, a subset of "lifted" run-time functions and executable semantics. Executable layout/segments/static linking. Dynamic type tables. Simple non-programmable C-style "metatypes", sufficient to program generics.
  • the run-time: values, functions, struct values, array values, heap/stack, threads/async, allocators.

Run time can be compiled for any host environment, compile-time should be on WASM/RISCV emulation (like cksystemsteaching/selfie/).

A run time unit is a package. A package is a compiled library that can also have an entry point to run as an executable. Packages are units of translation. Packages are linked (dynamically too?) to run processes. Packages get dynamic capabilities from packages that import them. Steal from Swift: type witness tables and layout-independent interop.

A package consists of static logical namespaces called modules. Modules get their capabilities from their package into their contexts and are statically verified. A module that gets a proc.memory capability should be considered unsafe (like unsafe code in Rust): it can violate memory and therefore capability safety. Modules are like Haskell monads or an effect system: they group functions with similar effects (?).

Interop

Goals:

  • make C ABI interop rich and powerful (so it works well)
  • make C headers interop easy, almost seamless technically (~@ cImport).
  • make it complex and demanding, proof-heavy when imported (lots of additional static information)
  • Rust interop via extern "interop_abi"?
  • WIT from WASM?
  • Q: GC-based languages

Entities

-1. propositions

Propositions are like Prop in Coq: they describe some properties of other entities and their relations.

???

0. functions

Functions arguments should be clear. The annotation clutter (types, resources/lifetimes, mutability, ref/use-borrow/own, traits, etc) can be moved out someplace else.

"functions as abstractions" (typed values) vs "functions as executable code" (resources that consume other resources like stack and executable segment, use and provide capabilities, etc). A stack frame as basically a struct value (also: good for async and reified state machines). Threads of executtion are explicit.

Runtime relies on heavily inlined functions:

  • arithmetics is functions
  • struct field access is functions
  • array access is functions
  • ...

1. pure values

Some values are pure, "platonic", they carry meaning, type and properties, but do not have a fixed memory encoding (i.e. size, alignment, etc). E.g.

Booleans: Bool = True | False Booleans operations:

  • (b: Bool).not() -> Bool
  • (b1: Bool).or(b2: Bool) -> Bool
  • (b1: Bool).and(b2: Bool) -> Bool
  • ...

Builtin numbers:

  • Bit = 0 | 1
  • num.Int: ..., -1, 0, 1, .... A pure number, without run-time representation guarantees; hidden in lang.num by default.
    • Int.zero(): Int.
    • (n: Int).succ(): Int
    • Int.one(): Int = Int.zero().succ()
    • (m: Int).add(n: Int): Int
    • (n: Int).twice(): Int
    • (n: Int).exp2(): Int

Builtin witnesses:

  • (num: Type).rangeFrom(start: num): a refinement type.
  • (num: Type).rangeTo(end: num): to end non-inclusively.

Aliases:

  • num.Nat: Type = Int.rangeFrom(Int.zero)
  • num.Fin: (n: Nat) -> Type = Nat.rangeTo(n)
  • bin.Num: (n: Nat) -> Type = Fin(n.exp2())

2. memory resources: cells and places

A cell: a piece of memory made of a size: Nat of octets of Bits (assumption: octets of bits).

# Addr is an abstract identity of a Cell:
const Addr
func (addr: Addr).space: AddrSpace
func (a1: Addr).is(a2: Addr): Bool

const Cell
# TODO: it is a Resource on Memory
func (p: Cell).addr: Addr
func (p: Cell).size: Nat
func (p: Cell).parent: Resource        # parent Cell? 
func (p: Cell).manager: Maybe(Allocator)

Addr is a member of a set of type AddrSpace. It can be many things: a bare bin.Num as a C-style pointer, a pair of (AddrSpace, bin.Num), a 16-bit x86 pair of (Segment, U16), a CHERI memory handle (TODO), a hash for a content-addressable store (?), etc.

Process memory segments are Cells without values (but e.g. unused .bss can have a witness that all its bits are zero). They are narrowed down to smaller and smaller non-overlapping Cells used as Places for values of specific types. Cells are linear resources.

A place is a combination of a cell + a value type + a binary encoding of values into sequences of bits.

with (
	size: Int,
	ty: Type,
	encoder: ty -> Seq(size, U8)
	decoder: Seq(size, U8) -> ty,
) {
	const Place(ty)
	with (p: Place(ty)) {
		p.cell(): Cell
		p.get(): ty
		p.ref(): Ref(ty)
		do p.set(value: ty)     # linear action, consumes and produces
	}
}

Q: should a Place <-> Type relation be 1:1? Or: a Cell can be a part of many Places?

2. arrays

Arrays are fixed-size resources, laid out in contiguous memory (possibly with stride != item.size?).

No access by bare Nat indexes. Indexes should be opaque and be/have witnesses that they are valid and more static metainformation.

test {
	let a = do mem.Array(Int, 5)
		
	# if statically known size, compiler can verify this (implicit witness?):
	do a.at(1) = 43

	# 
	do a.at(5) = -1 or error("out of bounds")
}

func (a: Array) Reverse() {
	var i0 = a.first()    # witness: i0 < a.len
	var i1 = a.last()     # witness: i1 < a.len
	while i0 < i1 {       # witness: i0 < i1
	    a.swap(i0, i1)    # since i0 < i1, these are different places
	    do i0.next()      # since i0 < a.len, no need to handle OOB
	    do i1.prev()      # since i0 < i1, no need to handle OOB
	}
}

TODO: Accomodate Structs of Arrays naturally.

3. structs

Struct are less important and more obfuscated than in many other low-level languages. The notion of a composite value (a typed record) and memory resource holding it (Cell) is explicitly separated.

Structs do not necessarily live in contiguous blobs of memory (e.g. SOA). Abstract references that do not necessarily correspond to a pointer. Access happens via accessors that are efficiently inlined.

4. traits

Can add complexity, but very useful to keep dot-namespaces flexible and extensible. Not sure how to fit them into the language yet.

Types

a) Either full dependent typing with universes (comptime "stages" to sound less bombastic) or b) limited or no type expressivity in compile time itself (?). const-eval should have a fixed set of simple C-like types (ints; chars; structs; fixed arrays?). They are projected into run-time and expanded by any user-defined types.

Yes, structs in compile time + associated methods, because compile time must have capabilities.

Type system: linear types/generics/traits + metaprogrammable, like in Zig.

1. values and functions

2. witnesses and properties

???

Resources

Resources are provided by the execution context. They are linear and should be recycled exactly once.

Capabilities

Dynamic capabilities: run-time objects that allow specific run-time actions. Structured linear collections of values + value-specific actions + limits on actions?. $$ Capability = (object, action, limits)$$ Capability objects usually allow narrowing down to a narrower capability (either consuming or uniqrefing their parent capability?). Or a narrower capability can be split from the parent (e.g. a buffer from a buffer).

Static capabilities:

  • rewriting code in a scope (e.g. async) during compilation
  • ensuring that some dynamic capability cannot be used in a scope (e.g. no core.mem, no core.alloc, no core.panic, no os).

It only makes sense for non-native, emulated/JIT execution. Already compiled native code can use them, but should not be trusted not to use other capabilities.

  • proc.memory: object: the entire process memory; actions: peek, poke, transmute. This is roughly unsafe Rust. This capability is special: having random memory access capability means having the ability to get all other capabilities as well. Can objects be reduced to specific buffers?

  • proc.alloc : object: heaps and their memory; actions: malloc, realloc, free.

  • proc.panic: object: thread, action: exit/abort/atexit; not giving this to a library gives you #[no_panic] in Rust.

  • proc.mmap: object: process memory segments, actions: mmap/munmap, listing segments of the current process and libraries, managing JIT segments, etc.

  • proc.lib: object: shared libraries, actions: list, search, dlopen/dlsym/...

  • proc.thread: objects: pthreads; actions: spawn, kill, synchronize;

  • proc.async: green threads, async/await? This is an interesting case: importing this capability changes how compiler generates code ("const-capability"?).

  • os.io: objects: stdin/stdout/stderr, actions: read/write/close.

  • os.fs: a list of directories + actions on them; ability to open/read/write files.

  • os.proc: interacting with other processes: getpid, fork, exec, kill, pipes, popen, ptrace; IPC?

  • os.time : a dict of clocks (walltime, monotonic, high-precision, etc) and timers

  • os.signal ?

  • os.user: objects: UIDs/GIDs, actions: getuid/setuid/setgid/...

  • net.dns: object: DNS domains or wildcards, actions: resolving DNS records

  • net.ip: object: IP endpoints, actions: getting ip addr/ip link info?

  • net.tcp: object: a TCP endpoint (IP endpoint + TCP port), actions: connecting, send/recv

  • net.udp: same as net.tcp

  • gpu.??? : access to GPU contexts, draw calls, textures, shaders, etc.

  • cap.manage: managing/introspecting capabilities themselves?

Static capabilities.

  • async: allows rewriting code of a scope into state machines at compile time.
  • no panic: statically guarantees that no panics can come out of this scope.
  • no alloc: statically guarantees that no allocations can happen in this scope.
  • safe memory: statically verified absense of direct core.mem usage.

References:

Open questions:

Safe code built on unsafe code. In other words, interplay between capabilities and modules/packages. E.g. std.Rc must use proc.memory capability.

  • but it's in std, can receive this authority via std.init() or something. Then it decides which modules get proc.memory ?
  • Modules get a capability statically: capability might be present in a loaded package instance, but statically verified to not be used in a specific module.
  • Getting a capability via memory introspection: possible using proc.memory.

Capabilities might be indirect:

  • writing to a file on NFS causes network requests;
  • ability to write /proc/self/mem means having core.mem (no language is truly memory safe on Linux!).
  • Opaque data can smuggle dynamic capabilities through a module that prohibits this capability (e.g. passing a reference to an allocator through a no-alloc module)?
  • Any piece of code with access to capabilities can implement a Turing-complete interpreter that makes static analysis of effects undecidable in general and then provide its capabilities to third-party code indirectly. Is static analysis of capabilities feasible at all? How limited a statically-verifiable version would be?

Run time

Values:

// byte := uint8('h')
let byte = 'h' : U8
let x = 12 : I32
let pi = 3.14 : F32

Structs/records:

// shortcut (?): struct Point {
const Point {
   x: F64
   y: F64
}

// shortcut: `impl Point` (?)
Point = trait {
	radial = (&self) func() {
		return .x 
	}
}

Minimalistic bootstrapping using proc.memory capability.

Memory management

Run-time behavior depends on given capabilities. Modules can be allocating or non-allocating. A package can request proc.alloc, then it is considered alloc-dependent. A sum of dependent package capabilities must be computed to ensure that a package is non-alloc (or e.g. non-panic). proc.alloc can be a heap memory allocator or a garbage-collecting allocator. Each package (module?) instance can use its own allocator.

Linear types + defer for resources that must be freed, including heap allocations?

Low-level, static memory management (i.e. ~ Rust core) in const-eval?

Compile-time

Heavily unified around const functions and values, Zig-style, with possible shortcuts.

The basic building block is const-functions, compile time functions. Modules are const-functions; imports are const-calls of modules. Types are const-values.

Compile-time evaluation: WASM interpreter? Constrain to terminating evaluations only?

const-data language

Const-data language should be (at the same time!)

  • a subset of the runtime language to be familiar (?)
  • it would be nice to have it more dynamic and declarative than the run-time language. A Jsonnet-like data language?

Q: Are these goals compatible at all?

Parenthesis are a "databag" structure: an ordered associative array (like a Lua table, possibly with implicit indexes).

(
	/*0 = */ 12,      // a literal, with implicit key 0
	key = value,      // explicit pair of const-literals "key", "value"
)

Function argument list is a bag at comptime. Record definitions is a bag. So are braces and source code files? Zig: files are just const structs.

Modules

IotaGoRustHaskell
global default namespace?"pkg.go.dev""crates.io"a global root
other namespaces?"github.com/", etccargo configurable?cabal configurable?
"project"?module, "go.mod"a cargo workspace or projecta cabal unit
"library"?packagea target in a projecta sub-unit
"binary"?package maina target in a projecta sub-unit
Translation unit?=package=crate=file

module namespaces

Go-style, one implicit global root like "pkg.iota.dev/", optionally other roots like "github.com/org/". These are root packages that include other packages recursively? The universe of packages is just a lazy bag, like in Nix?

A package file tree of a package my-pkg:

my-sub-pkg/
  package.iota
  ...

assets/           # static resources
build/            # like target/ in Cargo
docs/
deps/
  github.com/org/package/import.iota
  github.com/org/pkg/subpkg/import.iota
examples/
exe/              # like cmd/ in Go, app/ in Cabal ?
  binary1.iota
lib/              # like pkg/ in Go
  pubmod1.iota
  pubmod1.test.iota 
  pubmod2.iota
src/              # like internal/ in Go
  internal1.iota 
test/             # like tests/ in Cargo?
package.iota

It is equivalent to a table:

(
	package = @import("package.iota")
	lib = (
		pubmod1 = @import("lib/pubmod1.iota")
		pubmod2 = @import("lib/pubmod2.iota")
	)
	exe = (
		binary1 = @import("exe/binary1.iota")
	)
	src = (internal1 = @import("src/internal1.iota"))
	deps = (
	    `github.com`/org/pkg = @import("deps/github.com/org/package/import.iota")
	)
	...
)

What's a package?

  • a const-value of:
    • metadata
      • name, description, url, license, etc
      • dependencies
    • list of its modules
  • a const-trait that can build()/test()/etc its values?
    • iotac executable == "iota const evaluator". iota-"cargo" is just a library.
    • $ iotac build --target=binary1 == run const-fn package.build(target = binary1)
    • $ iotac test == run const-fn package.test()
    • ...

package.iota is a go.mod-like project file, an entry point to building. A mix of cargo.toml, Deno's main file, go.mod, etc.

// my_package = package(
// package my_package(ctx: Context)   // shortcut

// like cargo.toml,
// key `name` set to `my_package`:
name = "my_package"
description = "foos bars"
url = "author/lib"     // if uploaded to the main registry

dependencies = (
  "github.com/org/lib" = (version = "v0.1"),
)

require = (
  // the list of capabilities required from the environment
)

// Q: how to describe build targets and tests?

// Q: how to include modules?
// like `mod mod1;` in Rust, Go-style syntactically?

// end

binary vs library

Entry points require a uniq ref to a process resource (borrowed from the OS).

module definition

// shortcut: module main(ctx: Context) (...) { ... }
const main module(ctx: Context) (
	doc = "module documentation goes here?"
	// public definitions, module "signature"
) is

// imports
const std = ctx.import('std') 

// private definitions, implementations

end // closes is, optional if EOF

module imports

// full form
self.import(
  "std"                 // std = "pkg.iota.dev/std"
  "github.com/org/lib"  // lib = "github.com/org/lib"
  rename "name"
)

module contexts

The context of a module defines the list of accessible static capabilities.

An executable gets the root dynamic capability from the environment (e.g. from the ELF loader or another loader middleware).

A library gets its capabilities from its importer.

Standard library

logic.bool

The basic building block for data, the two-valued inductive type bool. Will become a machine bit.

logic.prop

The basic building block for proofs, types and reasoning about code.

logic.pair

The basic constructor for complex heterogeneous data. It's a building block for records.

logic.list

Type-wise: the basic constructor for homogeneous data. It's a building block for arrays, vectors, etc.

logic.lambda

Type-wise: structural expressions for untyped lambda, which are used later by a prover to manipulate terms. The basic building block for describing semantics of (unrestricted, not linear) code.

Q: should this be dependently-typed right away?

logic.linear

Augments logic.lambda with linear typing. It's a basic building block for describing resources, e.g. heap allocations and memory segments.

Describes the algebra of resources: allocation, single use, deallocation.

logic.ordered

Augments logic.lambda and logic.linear with an ordered type system (see Advanced Topics in Types and Programming Languages by Pierce). A building block for describing stacks and stacked contexts.

Describes the algebra of stacked resources, e.g. lifetimes.

logic.nat, logic.num, logic.fin, logic.field

Basic Peano numbers to be used in proofs. Their properties.

Finite enumerations and sets. Especially of size $2^n$.

(Finite?) fields and their properties. Especially on $2^n$-sized sets (Zig-like u2, u3, ..., u8, u9, ...).

logic.array

vec n T in dependent types. A fixed-size sequence of items of some type, indexed by fin. A special case: $2^n$-sized arrays (useful, because can be totally indexed by a bit sequence).

data.byte, data.word

Describe fixed-size sequences of bools: 8 bools for byte, 32/64 bools for word. Defines the usual CPU operations on them (shl, shr, not, and, ..., ).

Defines $\tt{u8 } = \tt{finfield}(+, 2^8)$, and a mapping $byte \iff \rm{u8}$
The same for u64 as $\tt{word} \iff \tt{finfield}(+, 2^{64})$.

Describes two-complement mapping $[-128 .. 128) \iff \tt{i8}$ and $[-2^{63} .. 2^{63}) \iff \tt{i64}$. Operations on signed numbers.

data.utf8

Describes a sequence of bytes which is a correct UTF8-encoded Unicode-ish string. A basic building block for symbols?

data.buffer

Describes a linearly typed array of bytes which can be read and written into (for byte, word, maybe utf8). Algebra of linearly typed arrays (ownership, splitting, swaps, etc). Special case: $2^n$-sized buffers.

data.bag

(Q: what are symbols? Are they just utf8 byte sequences? Are they (opaque) words, e.g. pointers/hashes? Any additional properties on them?)

Describes a JSON-structured data type, bag.

<bag> ::= unit                       /* json: null */
        | bool                       /* json: bool */
        | <nat>                      /* json: num */
		| <utf8>                     /* json: str */
        | ([<sym> =] <bag>, ...)     /* json: object/array */

A bag a known-size associative array with a known set of symbols/nats as keys. Order of key-value pairs is preserved. Keys can be implicit, filled by incremented nats, like in Lua.

Bags are values, their bit representation is not defined. Other things can be coerced to bags (records, any inductive data type structurally?).

data.parse

A small parser-combinator library.

lang.syntax

Describes the language:

  • tokens: keywords, literals, symbols, (, ), ...
  • AST structure as an inductive data type (on top of token bags?). Each node has a bag of attributes.

lang.parser

Uses data.parse to build an AST from a utf8 byte sequence.

lang.binary

An interface to serialize/deserialize bags+types.

lang.record

A tuple <a bag of data, data type>. Generic (serde-like) serialization/deserialization interface.

lang.struct

Mapping (binary serialization/deserialization implementation) of a record onto a linear buffer.

lang.func, lang.coro?

A func is a lambda that actually consumes/produces resources (stack frames, thread time, etc). Describes how lambda values become machine resources and run: stack frames, returns, calls, .... Calling conventions?

A coro is an executable state machine.

Q: closures?

core.mem

Provides: core.mem capability. Reading most (?) properties is safe (? but: ASLR). Destructive (i.e. write) changes are equivalent to having any process-wide capability. But: not every capability depends on core.mem (e.g. os.fs , os.user, net.ip.block are limits external to a process).

Objects:

  • address spaces: usually one, with its traits (32-bit, 64-bit, etc), but e.g. code space, data space for Harvard architecture.
  • memory segments: a set of $(offset, size)$ mappings with their own traits and allowed operations. Segments can grow (e.g. heap, stack). Can be executable, writable. Can be created/deleted (e.g. mmap/munmap, MMU operations). Q: are these just special, root memory regions?
  • memory regions (places?) (~ a resource that can hold a representation of a value, possibly complex one). Type encodings and transmutation, memory layout, alignment. Properties: offset, size, stride (may differ from size), read/write, volatile, executable, etc. Access patterns (random access, sequential access, etc).
  • references: a way to refer to a memory region for some purpose (e.g. reading, writing, managing, executing instructions). Operations: peek, poke. Explicitly shy away from pointers, especially as integers.

A remark about Rust: Rust does a very good job when relying on its set of assumptions (e.g. one address space, nonvolatile memory regions, internal ABI, pointers are integers), but is literally unsafe when stepping outside (e.g. lifetimes on JIT-compiled functions, memory segments going away underneath "safe" references, some other ABIs, NUMA, etc). Let's bring more clarity and explicitness here.

core.alloc

Requires: a core.mem capability. Provides: a core.alloc capability, negation of core.noalloc

Objects:

  • allocators: objects that can create/resize/consume memory places as resources.

Allocators can be created from a memory region and recycled themselves by their memory segment or parent region (e.g. enabling munmap). Allocators should be explicit (at least on a module level) and nudge to allocate memory closer to callstack roots, unlike the global allocator in Rust.

Allocation is fallible. Every allocation either should be handled, or means possibility of panic.

Allocated places are bound by the lifetime of their allocator

Allocators can be garbage-collecting, with no-op "freeing".