A fully functional compiler for a custom programming language, built from scratch using ANTLR4, C++17, and LLVM. This project spans the entire compilation pipeline — lexical analysis, parsing, AST construction, semantic analysis, LLVM IR code generation, and native executable creation — demonstrating systems programming skills and deep understanding of language implementation.
Compilers sit at the intersection of several demanding disciplines: formal language theory, type systems, memory management, tree transformations, and low-level code generation. Building one from scratch — rather than following a tutorial line-by-line — requires making real architectural decisions, debugging across abstraction layers, and reasoning about how high-level constructs map to machine execution.
This project was developed as a hands-on exercise in compiler construction, using AI assistance (Anthropic's Claude) as a collaborator for architecture design, scaffolding, and debugging — the same way a professional engineer uses documentation, Stack Overflow, or a senior colleague. Every line of code was reviewed, understood, and integrated by hand. The AI didn't write the compiler; it accelerated the learning.
Tiny is a statically-typed, imperative language designed to exercise the full ANTLR4 → C++ → LLVM pipeline. It's small enough to implement end-to-end, but rich enough to require real compiler engineering.
fn fibonacci(n: int) -> int {
if (n <= 1) {
return n;
}
return fibonacci(n - 1) + fibonacci(n - 2);
}
fn main() {
for (i in 0..15) {
print("fib(", i, ") = ", fibonacci(i));
}
}
Functions are values in Tiny. You can store them in variables, pass them as arguments, and return them from other functions. Lambdas capture variables from their enclosing scope:
fn makeAdder(n: int) -> fn(int) -> int {
return fn(x: int) -> int { return x + n; };
}
let add5 = makeAdder(5);
print("add5(3) = ", add5(3)); // 8
fn apply(f: fn(int) -> int, x: int) -> int {
return f(x);
}
print(apply(add5, 10)); // 15
- Type system:
int,float,bool,string, fixed-size arrays (int[3]), function types (fn(int) -> int) - First-class functions: Lambda expressions, closures with capture-by-value, higher-order functions
- Modules:
import "file.tiny";— multi-file programs with deduplication and circular-import detection - Bindings: Immutable (
let) and mutable (var) with optional type inference - Control flow:
if/else,while, range-basedfor - Functions: Named functions with typed parameters, return types, and recursion
- Expressions: Full arithmetic, comparison, and logical operators with correct precedence
- Built-ins:
print()with multiple arguments - Comments: Single-line (
//) and block (/* */)
The compiler follows a classical multi-pass architecture, with each phase cleanly separated behind a visitor interface:
source.tiny
│
▼
┌─────────────┐ ANTLR4 generates a lexer and parser from a formal
│ Lexer │ grammar (Tiny.g4). The lexer tokenizes source text;
│ Parser │ the parser builds a concrete syntax tree.
└──────┬──────┘
│ parse tree
▼
┌─────────────┐ ASTBuilder walks ANTLR's parse tree and constructs
│ ASTBuilder │ a clean, domain-specific AST decoupled from the
│ │ grammar. This is where syntax becomes structure.
└──────┬──────┘
│ AST
▼
┌─────────────┐ ModuleLoader resolves import declarations by
│ Module │ recursively parsing imported files, extracting their
│ Loader │ functions, and prepending them to the main AST.
│ │ Handles deduplication and circular import detection.
└──────┬──────┘
│ merged AST
▼
┌─────────────┐ SemanticAnalyzer enforces type safety, resolves
│ Semantic │ symbols through a scoped symbol table, checks
│ Analysis │ mutability, validates function signatures, and
│ │ performs capture analysis for closures.
└──────┬──────┘
│ validated AST (with capture info)
▼
┌─────────────┐ CodeGen walks the AST and emits LLVM IR using
│ CodeGen │ the LLVM C++ API — alloca/load/store for locals,
│ (LLVM IR) │ basic blocks for control flow, GEP for arrays,
│ │ closure structs for first-class functions.
└──────┬──────┘
│ output.ll
▼
┌─────────────┐ Clang compiles IR to native machine code, linked
│ clang │ against a small C++ runtime that implements
│ + runtime │ print() and other built-ins.
└─────────────┘
│
▼
executable
The grammar uses precedence climbing for expressions, explicit keyword tokens to prevent identifier capture, and clean separation of lexer and parser rules. It supports lambda expressions as primary expressions and function types as first-class type annotations. ANTLR4 generates the lexer and parser as C++ classes with a visitor interface.
The AST is entirely decoupled from ANTLR's parse tree. ANTLR's concrete syntax tree mirrors the grammar 1:1 — every rule produces a node. The AST is a cleaner, semantic tree (BinaryExpr, IfStmt, FunctionDecl, LambdaExpr) that later passes can traverse without knowing anything about ANTLR. This required solving a non-trivial engineering problem: ANTLR visitors return std::any, which requires copy-constructible types, but AST nodes use unique_ptr for ownership. The solution uses a shared_ptr<Holder> wrapper that satisfies std::any's copy requirement while preserving unique ownership semantics.
The semantic analyzer performs a two-pass approach over declarations — first registering all function signatures, then analyzing bodies — enabling mutual recursion. For closures, it walks the lambda body to identify all variable references, subtracts parameters and local declarations, and records the remaining names as captures. It also supports calling variables with function types, not just named functions. It enforces type safety across all operators, validates mutability (let vs var), checks function call arity and argument types, and infers types from initializers.
Closures are represented at the LLVM level as a struct { i8* fn_ptr, i8* env_ptr }. When a lambda is created, the compiler heap-allocates an environment struct via malloc, copies captured variable values into it, and generates an internal LLVM function that takes the environment as a hidden first parameter. At the call site, the function pointer and environment are extracted from the struct and the call is dispatched through an indirect function pointer. This is the same strategy used by production compilers for languages like Swift and Rust.
Every compiler phase implements the same ASTVisitor interface. The AST printer, semantic analyzer, and code generator are all visitors — adding a new pass (optimization, linting, formatting) means writing one new class. This design mirrors production compilers.
The code generator maps Tiny constructs to LLVM IR using the C++ API:
| Tiny Construct | LLVM Pattern |
|---|---|
var x: int = 42 |
alloca → store → load |
if / else |
Basic blocks + conditional br |
while, for..in |
Loop headers, back-edges, br |
fn add(a, b) |
Function::Create + entry basic block |
arr[i] |
getelementptr (GEP) |
print(...) |
External @tiny_print_* calls |
fn(x) { x + 1 } |
Internal function + malloc env + closure struct |
f(args) |
Extract fn/env from closure, indirect call |
import "math.tiny" |
AST-level merge — imported FunctionDecl nodes prepended before analysis |
-O2 |
LLVM PassBuilder optimization pipeline |
| Source locations | DIBuilder — DICompileUnit, DISubprogram, DILocalVariable, DILocation |
Multi-file programs are supported via import "file.tiny"; at the top of any source file. The ModuleLoader resolves imports at the AST level before semantic analysis runs: it parses each imported file, recursively resolves its own imports, and prepends the resulting FunctionDecl nodes into the main program's declaration list. This means the semantic analyzer and code generator see a single, flat, fully-resolved program — no linker involvement required. The loader uses two sets to detect circular imports early (while the import chain is still being walked) and to deduplicate modules that are imported from multiple files.
// math.tiny
fn square(x: int) -> int { return x * x; }
// main.tiny
import "math.tiny";
print(square(7)); // 49
The compiler emits full DWARF debug information via LLVM's DIBuilder API, so compiled programs can be stepped through in GDB or LLDB at the .tiny source level — not just at the machine-code level. Every function gets a DISubprogram, every local variable a DILocalVariable with insertDeclare, every block a DILexicalBlock, and every statement a DILocation. The CodeGen constructor accepts the source file path and wires it into a DICompileUnit with DW_LANG_C. A critical subtlety: LLVM's IRBuilder retains the current debug location across SetInsertPoint, so the implementation explicitly clears it when entering each new function to prevent scope bleed.
# Compile with debug info (default — no extra flag needed)
./tinyc examples/fibonacci.tiny -o fib.ll
clang fib.ll runtime/runtime.cpp -o fib -no-pie
# Step through .tiny source in GDB
gdb fib
(gdb) b fib_rec
(gdb) run
# Breakpoint 1, fib_rec (n=0) at fibonacci.tiny:4
(gdb) listEnd-to-end tests use .tiny / .expected file pairs — the test runner compiles each program, executes the binary, and diffs stdout against expected output. Error tests verify that invalid programs produce correct diagnostics. Unit tests cover individual components with GoogleTest.
- CMake ≥ 3.20
- C++17 compiler (GCC 9+ / Clang 10+)
- ANTLR4 C++ runtime (4.13.2)
- LLVM 15+ development libraries
- Java runtime (for ANTLR4 tool — grammar regeneration only)
- GoogleTest (for unit tests — optional, auto-detected)
sudo apt install cmake ninja-build g++ clang default-jre \
llvm-dev libffi-dev libedit-dev libncurses-dev zlib1g-dev libzstd-devcmake -B build
cmake --build build -j$(nproc)
# Compile a Tiny program to LLVM IR
./build/tinyc examples/hello.tiny -o output.ll
# With optimizations
./build/tinyc examples/hello.tiny -o output.ll -O2
# Link with runtime and run as native executable
clang output.ll runtime/runtime.cpp -o hello -no-pie
./hello
# Debug flags
./build/tinyc examples/hello.tiny --dump-tokens # Print token stream
./build/tinyc examples/hello.tiny --dump-ast # Print AST# Integration tests — compiles and executes .tiny programs, diffs against .expected output
python3 tests/programs/run_tests.py --compiler ./build/tinyc --runtime runtime/runtime.cpp
# Unit tests — GoogleTest suite covering lexer, parser, semantic analyzer, AST primitives
./build/tests/tiny_testsSee PROJECT_STRUCTURE.md for the full directory layout, architecture diagram, and design decision rationale.
This project was built incrementally using AI assistance as a learning accelerator:
- Architecture and scaffolding: Claude helped design the directory structure, CMake build system, grammar, and AST node hierarchy — the kind of boilerplate that takes hours to get right but teaches little.
- Debugging: When builds failed (ANTLR version mismatches,
std::anycopy semantics, LLVM CMake integration), Claude helped diagnose root causes from compiler output — a skill that transfers directly to real-world development. - Implementation: Each compiler phase was discussed, designed, and then coded with AI assistance. Every component was reviewed and understood before integration.
This mirrors professional software development, where engineers routinely use tools, references, and collaborators to build systems they fully understand. The learning is in the architecture decisions, the debugging, and the integration — not in memorizing boilerplate.
- Systems programming: C++17 with move semantics, smart pointers, CRTP, virtual dispatch, and template metaprogramming
- Compiler engineering: Lexing, parsing, AST design, type systems, scope resolution, capture analysis, code generation
- LLVM: IR generation via the C++ API — basic blocks, alloca/load/store, GEP, function definitions, external linkage, closure structs, indirect calls, optimization passes,
DIBuilderfor DWARF debug info - Build systems: CMake with custom targets, external tool integration, multi-library linking
- Software architecture: Clean separation of concerns, visitor pattern, test-driven development
- Tool proficiency: ANTLR4, LLVM, GoogleTest, GDB/LLDB, CMake + Ninja, VS Code + WSL, Git
- ANTLR4 grammar and C++ lexer/parser generation
- Clean AST design with visitor pattern
- AST builder (parse tree → AST)
- AST pretty-printer for debugging
- Scoped symbol table
- Semantic analysis (type checking, mutability enforcement)
- LLVM IR code generation
- Runtime library (print, newline)
- End-to-end compilation to native executables
- Optimization passes via LLVM PassManager (
-O1/-O2/-O3) - Closures and first-class functions (lambda expressions, capture analysis, heap-allocated environments)
- LLVM debug info — DWARF via
DIBuilder, GDB/LLDB source-level stepping through.tinyfiles - Module system —
import "file.tiny";with AST-level merging, deduplication, and circular import detection - Capture-by-reference for mutable closures
- String operations (concatenation, length)
MIT