Large Language Models are revolutionizing code completion and vulnerability detection, but they share a critical bottleneck: context limits and noise. You cannot feed an entire repository into an LLM, and naive text-retrieval (RAG) often pulls in irrelevant code that confuses the model.
CodeGraphene solves this through Semantic Compression. It parses codebases into mathematical Code Property Graphs (CPGs), slices out only the structurally and semantically relevant code, and serializes it into LLM-ready prompts.
- Semantic Compression: Reduce LLM prompt sizes while retaining critical structural context.
- Variable Granularity: Analyze code at the
LINE,METHOD, orFILElevel natively. - Highly Modular Architecture: Hot-swap Parsers, Trimmers, and Serializers to run ablation studies in minutes.
- Parser Agnostic: Currently powered by Joern, with an architecture built to easily support Tree-sitter or other AST extractors in the future.
CodeGraphene is built around a simple, 3-stage pipeline: Parse -> Trim -> Serialize.
from codegraphene import GraphPipeline, NodeGranularity
from codegraphene.parsers.joern import JoernParser
from codegraphene.trimmers.khop import KHopTrimmer
from codegraphene.serializers.text import CodeReconstructionSerializer
# 1. Configure the modular pipeline
pipeline = GraphPipeline(
parser=JoernParser(granularity=NodeGranularity.LINE),
trimmer=KHopTrimmer(hops=1), # Extract a 1-hop structural neighborhood
serializer=CodeReconstructionSerializer(granularity=NodeGranularity.LINE)
)
# 2. Extract highly-compressed context for a specific target
# (e.g., finding the context around line 30 in a target file)
llm_prompt = pipeline.run(file_path="examples/sample_code.py", target=30)
print(llm_prompt)The best way to understand CodeGraphene is to see it in action. We have provided a suite of interactive Jupyter Notebooks in the /examples directory to walk you through the framework.
00_quickstart_pipeline.ipynb
The "Hello World" of CodeGraphene. Learn how the Parser, Trimmer, and Serializer fit together to compress a single Python file.01_granularities.ipynb
Discover how swapping theNodeGranularityconfig (LINE,METHOD,FILE) changes the shape of the graph and the resulting LLM prompt.03_exploring_raw_cpgs.ipynb
Dive under the hood to analyze the raw Code Property Graph output generated by Joern before CodeGraphene filters it.
CodeGraphene relies on Joern for CPG extraction. Because Joern is highly optimized for Unix environments, we strongly recommend running CodeGraphene on Linux or Windows Subsystem for Linux (WSL 2).
Ensure you have Java (JDK 11 or 17) and unzip installed, then run the official Joern installer:
curl -L "https://github.com/joernio/joern/releases/latest/download/joern-install.sh" | shMake sure the directory containing the joern executable is accessible in your system PATH (usually ~/bin).
Clone the repository and install it in editable mode:
git clone https://github.com/stg-tud/CodeGraphene.git
cd CodeGraphene
python3 -m venv .venv
source .venv/bin/activate
pip install -e .CodeGraphene enforces strict decoupling to allow researchers to easily experiment with graph-augmented LLM strategies:
Parsers: Ingest raw code and convert it into a standardizedCodeGraph(a NetworkX MultiDiGraph). They handle granularity collapsing (e.g., merging AST tokens intoLINEorMETHODnodes).Trimmers: Accept a massive graph and a target node, returning a minimal, context-rich subgraph.Serializers: Convert the optimized subgraph into a format the LLM can understand, such as reconstructed sequential code text.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.