This repository contains the experimental codebase, datasets, and logs for our mechanistic interpretability study on Transformer Semantics and Dynamic Computational Pathways.
The code evaluates two major hypotheses:
- The Forward Proportion Hypothesis: Semantic behavior is conditioned on the activation trajectory rather than solely the current state.
- The Dynamic Computational Pathway Hypothesis: Meaning emerges dynamically through context-sensitive interaction topologies between attention heads.
Our findings falsify the dynamic rewiring of attention heads and instead support the Combinatorial Recruitment Model—where fixed-behavior attention heads communicate indirectly via a shared residual stream, and contextual semantics emerge from the combinatorial selection of active components. Furthermore, we isolate a 27.3% irreducible variance in output divergence that cannot be explained by single-position state similarity, pointing to limits in state-based semantic prediction.
-
Co-Activation Communities: Attention heads form strong, stable communities (Modularity
$Q=0.536$ ) that distinctly separate symbolic tasks (coding/math) from natural language. - Topology Predicts Semantics: Interaction graph topology carries independent predictive power, improving semantic category prediction to 91.7% (vs 88.3% using activation magnitude alone).
-
No Dynamic Rewiring: Individual heads do not dynamically rewire across contexts (similarity ratio
$\approx 1.002$ ). They exhibit stereotyped, rigid interaction patterns. - Direct Pathways are Insignificant: Targeted disruption of direct head-to-head attention yields only 3.5% of the effect of full pair ablation.
- The 27.3% Mystery: Even using unembedding-projected distance metrics, 27.3% of the semantic divergence between convergent prompt trajectories remains unpredictable from single-position similarity, highlighting the complex nature of distributed, cross-position nonlinear processing.
The research pipeline is divided into multi-phase iterations. The code is modular and built on top of TransformerLens.
code/phase0_setup.py: Core utility functions, GPT-2-small loading, logger initialization, and metric definitions (Cosine, KL Divergence, Logit Lens).
code/iter1_...: Scripts handling the initial dataset creation, activation fingerprinting, community detection (Louvain), activation patching for the Forward Proportion test, and the falsification of Semantic Rings.
-
code/iter2_phase1_interaction_graphs.py: Constructs$144 \times 144$ interaction matrices based on attention-weighted upstream activations. -
code/iter2_phase2_pathway_prediction.py: Trains Logistic Regression classifiers (with 5-fold CV) comparing Activation Magnitude vs Interaction Topology. -
code/iter2_phase3_context_sensitivity.py: Tests whether target heads alter their interaction topologies based on the semantic context. -
code/iter2_phase4_pathway_perturbation.py: Performs causal interventions (Pair Ablation vs Pathway Disruption) to test the causal necessity of direct A$\rightarrow$ B circuits.
-
code/the_final_answer.py(The 42% Experiment): The definitive test resolving the unexplained variance. Compares Baseline Cosine, Post-LayerNorm Cosine, and Unembed-projected distance, and performs SVD directional decomposition of$W_U$ .
-
Clone the repository:
git clone https://github.com/yourusername/forward-proportion.git cd forward-proportion -
Install dependencies: This project requires Python 3.8+ and PyTorch.
pip install torch transformer-lens networkx scikit-learn scipy community-louvain matplotlib
-
Running the experiments: The codebase was originally designed for execution in Google Colab (T4 GPU recommended). You can run the scripts sequentially in a Jupyter Notebook environment or directly via Python:
python code/phase0_setup.py python code/iter2_phase1_interaction_graphs.py # ...
The full theoretical framework, mathematical formulations, and detailed analysis can be found in the accompanying LaTeX manuscript located in Iteration One/research_paper.tex.
(Placeholder for Zenodo / arXiv link once published)
This project is open-sourced under the MIT License. See the LICENSE file for details.