Mechanistic Interpretability in Large Language Models

🔍 Research Focus: Understanding the Inner Workings of LLMs

I am a Research Assistant at the University of Tübingen, working with Dr. Thomas Wolfers and Dr. Çağatay Yıldız on the mechanistic interpretability of large language models. My research centers on developing novel approaches to understand how these complex systems acquire, represent, and access knowledge.

📰 Recent News

🧪 Current Research

My work focuses on two complementary areas:

Knowledge Measurement & Evaluation: Developing contamination-free evaluation frameworks for domain-specific knowledge in LLMs, extending beyond traditional perplexity metrics to understand true domain understanding.

Activation Engineering: Investigating how domain knowledge emerges as targetable directions in model activation space, enabling systematic control without traditional fine-tuning approaches.

💡 Research Philosophy

Following Richard Feynman’s principle “What I cannot create, I do not understand,” my research aims to reverse-engineer the internal mechanisms of language models. By understanding how these systems process and represent knowledge, we can build more reliable, controllable, and interpretable AI systems.

🔬 Key Contributions

  • Domain-Specific Evaluation: Created deterministic pipelines for contamination-free LLM evaluation using large-scale datasets (arXiv: 1.56M documents, M2D2: 8.5B tokens)
  • Continual Learning: Investigated how model size affects knowledge acquisition and retention during continual pretraining across diverse domains
  • Activation Engineering: Developed techniques to access latent knowledge through steering vectors and activation patterns

🎯 Impact & Applications

This research enables practical breakthroughs in:

  • Model Efficiency: Targeted knowledge editing without full retraining
  • Safety & Control: Direct model steering through activation engineering
  • Robust Evaluation: Better understanding of what models truly know vs. memorize
  • Knowledge Transfer: Optimizing how models adapt to new domains

Research conducted at the Bethge Lab, Vernade Lab, and in collaboration with the Mental Health Mapping Lab, University of Tübingen.