Exploring the Inner Workings of Large Language Models

🎓 Open to Opportunities: I am currently seeking PhD positions and research internships in the field of Machine Learning and AI, particularly focusing on LLM interpretability and mechanistic understanding.

🔍 Current Research: Mechanistic Interpretability in LLMs

I am a 2nd-year master’s student, currently pursuing my thesis under Prof. Matthias Bethge and Dr. Çağatay Yıldiz at the Bethge Lab. In my research, I am delving deep into the fascinating world of large language models. My research focuses on developing novel approaches to quantify and track knowledge acquisition in these complex systems through mechanistic interpretability.

🧪 Current Research Directions

Our research extends Geva et al.’s (2023) foundational work on factual associations in auto-regressive models, developing more sophisticated measurement techniques. A critical insight emerged from Öncel et al.’s (2024) findings that traditional perplexity metrics can be deceptive – lower perplexity scores don’t always indicate true domain understanding.

I’m currently advancing research along two primary axes:

  1. Knowledge Measurement Pipeline: Developing transferable methodologies for measuring domain-specific knowledge across model layers, building on “Dissecting Recall of Factual Associations in Auto-Regressive Language Models”.

  2. Domain-Specific Pre-training: Investigating how specialized pre-training shapes model capabilities, extending insights from “Investigating Continual Pretraining in Large Language Models”.

A key innovation in our approach is the application of activation engineering through steering vectors – essentially creating interpretable maps of knowledge representation across model layers. This builds on Arditi et al.’s (2024) demonstration that LLMs encode features as linear directions in their activation space, complemented by Merullo et al.’s (2024) work on vector arithmetic in language models.

💡 Why Mechanistic Interpretability?

Richard Feynman’s principle, “What I cannot create, I do not understand,” perfectly encapsulates the mission of mechanistic interpretability. The field has achieved remarkable breakthroughs in demystifying neural networks, from Elhage et al.’s (2021) mathematical frameworks for transformer circuits to Elhage et al.’s (2022) understanding of superposition phenomena. Recent advances in automated circuit discovery (Conmy et al., 2023) and interpretable circuits in GPT-2 (Wang et al., 2022) demonstrate that systematic reverse engineering of these complex systems is not just possible – it’s revolutionizing our understanding.

🔬 Practical Applications

These theoretical advances have enabled significant practical breakthroughs:

🎯 Research Focus

My research explores the crucial intersection of knowledge representation and model behavior. By combining activation engineering techniques with causal intervention methods, I investigate how models process and adapt domain-specific knowledge while maintaining their core capabilities. This work is particularly inspired by knowledge localization techniques (Meng et al., 2023), factual association tracking (Geva et al., 2023), and advances in model steering (Arditi et al., 2024).

The most compelling aspect of this work is its potential to transform theoretical insights into practical applications, advancing both our understanding and our ability to create more reliable, controllable AI systems. Apart from the papers mentioned above, I draw significant inspiration from several groundbreaking works in the field, including: