Skip to content

ScottsSecondAct/tiny-compiler

Repository files navigation

Tiny Compiler

Open Source License: MIT AI Assisted

A fully functional compiler for a custom programming language, built from scratch using ANTLR4, C++17, and LLVM. This project spans the entire compilation pipeline — lexical analysis, parsing, AST construction, semantic analysis, LLVM IR code generation, and native executable creation — demonstrating systems programming skills and deep understanding of language implementation.

Why This Project

Compilers sit at the intersection of several demanding disciplines: formal language theory, type systems, memory management, tree transformations, and low-level code generation. Building one from scratch — rather than following a tutorial line-by-line — requires making real architectural decisions, debugging across abstraction layers, and reasoning about how high-level constructs map to machine execution.

This project was developed as a hands-on exercise in compiler construction, using AI assistance (Anthropic's Claude) as a collaborator for architecture design, scaffolding, and debugging — the same way a professional engineer uses documentation, Stack Overflow, or a senior colleague. Every line of code was reviewed, understood, and integrated by hand. The AI didn't write the compiler; it accelerated the learning.

The Tiny Language

Tiny is a statically-typed, imperative language designed to exercise the full ANTLR4 → C++ → LLVM pipeline. It's small enough to implement end-to-end, but rich enough to require real compiler engineering.

fn fibonacci(n: int) -> int {
    if (n <= 1) {
        return n;
    }
    return fibonacci(n - 1) + fibonacci(n - 2);
}

fn main() {
    for (i in 0..15) {
        print("fib(", i, ") = ", fibonacci(i));
    }
}

Closures and First-Class Functions

Functions are values in Tiny. You can store them in variables, pass them as arguments, and return them from other functions. Lambdas capture variables from their enclosing scope:

fn makeAdder(n: int) -> fn(int) -> int {
    return fn(x: int) -> int { return x + n; };
}

let add5 = makeAdder(5);
print("add5(3) = ", add5(3));    // 8

fn apply(f: fn(int) -> int, x: int) -> int {
    return f(x);
}
print(apply(add5, 10));          // 15

Language Features

  • Type system: int, float, bool, string, fixed-size arrays (int[3]), function types (fn(int) -> int)
  • First-class functions: Lambda expressions, closures with capture-by-value, higher-order functions
  • Modules: import "file.tiny"; — multi-file programs with deduplication and circular-import detection
  • Bindings: Immutable (let) and mutable (var) with optional type inference
  • Control flow: if/else, while, range-based for
  • Functions: Named functions with typed parameters, return types, and recursion
  • Expressions: Full arithmetic, comparison, and logical operators with correct precedence
  • Built-ins: print() with multiple arguments
  • Comments: Single-line (//) and block (/* */)

Architecture

The compiler follows a classical multi-pass architecture, with each phase cleanly separated behind a visitor interface:

 source.tiny
     │
     ▼
┌─────────────┐    ANTLR4 generates a lexer and parser from a formal
│  Lexer       │    grammar (Tiny.g4). The lexer tokenizes source text;
│  Parser      │    the parser builds a concrete syntax tree.
└──────┬──────┘
       │ parse tree
       ▼
┌─────────────┐    ASTBuilder walks ANTLR's parse tree and constructs
│  ASTBuilder  │    a clean, domain-specific AST decoupled from the
│              │    grammar. This is where syntax becomes structure.
└──────┬──────┘
       │ AST
       ▼
┌─────────────┐    ModuleLoader resolves import declarations by
│  Module      │    recursively parsing imported files, extracting their
│  Loader      │    functions, and prepending them to the main AST.
│              │    Handles deduplication and circular import detection.
└──────┬──────┘
       │ merged AST
       ▼
┌─────────────┐    SemanticAnalyzer enforces type safety, resolves
│  Semantic    │    symbols through a scoped symbol table, checks
│  Analysis    │    mutability, validates function signatures, and
│              │    performs capture analysis for closures.
└──────┬──────┘
       │ validated AST (with capture info)
       ▼
┌─────────────┐    CodeGen walks the AST and emits LLVM IR using
│  CodeGen     │    the LLVM C++ API — alloca/load/store for locals,
│  (LLVM IR)   │    basic blocks for control flow, GEP for arrays,
│              │    closure structs for first-class functions.
└──────┬──────┘
       │ output.ll
       ▼
┌─────────────┐    Clang compiles IR to native machine code, linked
│  clang       │    against a small C++ runtime that implements
│  + runtime   │    print() and other built-ins.
└─────────────┘
       │
       ▼
   executable

Technical Highlights

Grammar Design (ANTLR4)

The grammar uses precedence climbing for expressions, explicit keyword tokens to prevent identifier capture, and clean separation of lexer and parser rules. It supports lambda expressions as primary expressions and function types as first-class type annotations. ANTLR4 generates the lexer and parser as C++ classes with a visitor interface.

AST Independence

The AST is entirely decoupled from ANTLR's parse tree. ANTLR's concrete syntax tree mirrors the grammar 1:1 — every rule produces a node. The AST is a cleaner, semantic tree (BinaryExpr, IfStmt, FunctionDecl, LambdaExpr) that later passes can traverse without knowing anything about ANTLR. This required solving a non-trivial engineering problem: ANTLR visitors return std::any, which requires copy-constructible types, but AST nodes use unique_ptr for ownership. The solution uses a shared_ptr<Holder> wrapper that satisfies std::any's copy requirement while preserving unique ownership semantics.

Semantic Analysis & Capture Analysis

The semantic analyzer performs a two-pass approach over declarations — first registering all function signatures, then analyzing bodies — enabling mutual recursion. For closures, it walks the lambda body to identify all variable references, subtracts parameters and local declarations, and records the remaining names as captures. It also supports calling variables with function types, not just named functions. It enforces type safety across all operators, validates mutability (let vs var), checks function call arity and argument types, and infers types from initializers.

Closure Implementation

Closures are represented at the LLVM level as a struct { i8* fn_ptr, i8* env_ptr }. When a lambda is created, the compiler heap-allocates an environment struct via malloc, copies captured variable values into it, and generates an internal LLVM function that takes the environment as a hidden first parameter. At the call site, the function pointer and environment are extracted from the struct and the call is dispatched through an indirect function pointer. This is the same strategy used by production compilers for languages like Swift and Rust.

Visitor Pattern Across Phases

Every compiler phase implements the same ASTVisitor interface. The AST printer, semantic analyzer, and code generator are all visitors — adding a new pass (optimization, linting, formatting) means writing one new class. This design mirrors production compilers.

LLVM Code Generation

The code generator maps Tiny constructs to LLVM IR using the C++ API:

Tiny Construct LLVM Pattern
var x: int = 42 allocastoreload
if / else Basic blocks + conditional br
while, for..in Loop headers, back-edges, br
fn add(a, b) Function::Create + entry basic block
arr[i] getelementptr (GEP)
print(...) External @tiny_print_* calls
fn(x) { x + 1 } Internal function + malloc env + closure struct
f(args) Extract fn/env from closure, indirect call
import "math.tiny" AST-level merge — imported FunctionDecl nodes prepended before analysis
-O2 LLVM PassBuilder optimization pipeline
Source locations DIBuilderDICompileUnit, DISubprogram, DILocalVariable, DILocation

Module System

Multi-file programs are supported via import "file.tiny"; at the top of any source file. The ModuleLoader resolves imports at the AST level before semantic analysis runs: it parses each imported file, recursively resolves its own imports, and prepends the resulting FunctionDecl nodes into the main program's declaration list. This means the semantic analyzer and code generator see a single, flat, fully-resolved program — no linker involvement required. The loader uses two sets to detect circular imports early (while the import chain is still being walked) and to deduplicate modules that are imported from multiple files.

// math.tiny
fn square(x: int) -> int { return x * x; }

// main.tiny
import "math.tiny";
print(square(7));   // 49

Source-Level Debugging (LLVM DIBuilder)

The compiler emits full DWARF debug information via LLVM's DIBuilder API, so compiled programs can be stepped through in GDB or LLDB at the .tiny source level — not just at the machine-code level. Every function gets a DISubprogram, every local variable a DILocalVariable with insertDeclare, every block a DILexicalBlock, and every statement a DILocation. The CodeGen constructor accepts the source file path and wires it into a DICompileUnit with DW_LANG_C. A critical subtlety: LLVM's IRBuilder retains the current debug location across SetInsertPoint, so the implementation explicitly clears it when entering each new function to prevent scope bleed.

# Compile with debug info (default — no extra flag needed)
./tinyc examples/fibonacci.tiny -o fib.ll
clang fib.ll runtime/runtime.cpp -o fib -no-pie

# Step through .tiny source in GDB
gdb fib
(gdb) b fib_rec
(gdb) run
# Breakpoint 1, fib_rec (n=0) at fibonacci.tiny:4
(gdb) list

Testing Strategy

End-to-end tests use .tiny / .expected file pairs — the test runner compiles each program, executes the binary, and diffs stdout against expected output. Error tests verify that invalid programs produce correct diagnostics. Unit tests cover individual components with GoogleTest.

Build

Prerequisites

  • CMake ≥ 3.20
  • C++17 compiler (GCC 9+ / Clang 10+)
  • ANTLR4 C++ runtime (4.13.2)
  • LLVM 15+ development libraries
  • Java runtime (for ANTLR4 tool — grammar regeneration only)
  • GoogleTest (for unit tests — optional, auto-detected)

Ubuntu / Debian

sudo apt install cmake ninja-build g++ clang default-jre \
    llvm-dev libffi-dev libedit-dev libncurses-dev zlib1g-dev libzstd-dev

Build & Run

cmake -B build
cmake --build build -j$(nproc)

# Compile a Tiny program to LLVM IR
./build/tinyc examples/hello.tiny -o output.ll

# With optimizations
./build/tinyc examples/hello.tiny -o output.ll -O2

# Link with runtime and run as native executable
clang output.ll runtime/runtime.cpp -o hello -no-pie
./hello

# Debug flags
./build/tinyc examples/hello.tiny --dump-tokens   # Print token stream
./build/tinyc examples/hello.tiny --dump-ast      # Print AST

Run Tests

# Integration tests — compiles and executes .tiny programs, diffs against .expected output
python3 tests/programs/run_tests.py --compiler ./build/tinyc --runtime runtime/runtime.cpp

# Unit tests — GoogleTest suite covering lexer, parser, semantic analyzer, AST primitives
./build/tests/tiny_tests

Project Structure

See PROJECT_STRUCTURE.md for the full directory layout, architecture diagram, and design decision rationale.

Development Process & AI Collaboration

This project was built incrementally using AI assistance as a learning accelerator:

  • Architecture and scaffolding: Claude helped design the directory structure, CMake build system, grammar, and AST node hierarchy — the kind of boilerplate that takes hours to get right but teaches little.
  • Debugging: When builds failed (ANTLR version mismatches, std::any copy semantics, LLVM CMake integration), Claude helped diagnose root causes from compiler output — a skill that transfers directly to real-world development.
  • Implementation: Each compiler phase was discussed, designed, and then coded with AI assistance. Every component was reviewed and understood before integration.

This mirrors professional software development, where engineers routinely use tools, references, and collaborators to build systems they fully understand. The learning is in the architecture decisions, the debugging, and the integration — not in memorizing boilerplate.

Skills Demonstrated

  • Systems programming: C++17 with move semantics, smart pointers, CRTP, virtual dispatch, and template metaprogramming
  • Compiler engineering: Lexing, parsing, AST design, type systems, scope resolution, capture analysis, code generation
  • LLVM: IR generation via the C++ API — basic blocks, alloca/load/store, GEP, function definitions, external linkage, closure structs, indirect calls, optimization passes, DIBuilder for DWARF debug info
  • Build systems: CMake with custom targets, external tool integration, multi-library linking
  • Software architecture: Clean separation of concerns, visitor pattern, test-driven development
  • Tool proficiency: ANTLR4, LLVM, GoogleTest, GDB/LLDB, CMake + Ninja, VS Code + WSL, Git

Roadmap

  • ANTLR4 grammar and C++ lexer/parser generation
  • Clean AST design with visitor pattern
  • AST builder (parse tree → AST)
  • AST pretty-printer for debugging
  • Scoped symbol table
  • Semantic analysis (type checking, mutability enforcement)
  • LLVM IR code generation
  • Runtime library (print, newline)
  • End-to-end compilation to native executables
  • Optimization passes via LLVM PassManager (-O1 / -O2 / -O3)
  • Closures and first-class functions (lambda expressions, capture analysis, heap-allocated environments)
  • LLVM debug info — DWARF via DIBuilder, GDB/LLDB source-level stepping through .tiny files
  • Module system — import "file.tiny"; with AST-level merging, deduplication, and circular import detection
  • Capture-by-reference for mutable closures
  • String operations (concatenation, length)

License

MIT

About

A compiler for a statically-typed language, built with ANTLR4, C++17, and LLVM. Implements a full pipeline from lexing through semantic analysis to native code generation, with closures, first-class functions, modules, and DWARF debug info.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors