Mechanistic Interpretability in Large Language Models

🔍 Research Focus: Understanding the Inner Workings of LLMs

I am a Research Assistant at the University of Tübingen, working with Dr. Thomas Wolfers and Dr. Çağatay Yıldız on the mechanistic interpretability of large language models. My research centers on developing novel approaches to understand how these complex systems acquire, represent, and access knowledge.

📰 Recent News

June 2025: New preprint released: "Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping" on arXiv
April 2025: Paper "Investigating Continual Pretraining in Large Language Models" accepted at TMLR
April 2025: Started Research Assistant position at University of Tübingen under Dr. Thomas Wolfers and Dr. Çağatay Yıldız
March 2025: Successfully defended Master's thesis on "Mechanistic Understanding of Factual Knowledge in LLMs" at Bethge Lab
2024: Awarded Deutschlandstipendium scholarship for outstanding academic achievements

🧪 Current Research

My work focuses on two complementary areas:

Knowledge Measurement & Evaluation: Developing contamination-free evaluation frameworks for domain-specific knowledge in LLMs, extending beyond traditional perplexity metrics to understand true domain understanding.

Activation Engineering: Investigating how domain knowledge emerges as targetable directions in model activation space, enabling systematic control without traditional fine-tuning approaches.

💡 Research Philosophy

Following Richard Feynman’s principle “What I cannot create, I do not understand,” my research aims to reverse-engineer the internal mechanisms of language models. By understanding how these systems process and represent knowledge, we can build more reliable, controllable, and interpretable AI systems.

🔬 Key Contributions

Domain-Specific Evaluation: Created deterministic pipelines for contamination-free LLM evaluation using large-scale datasets (arXiv: 1.56M documents, M2D2: 8.5B tokens)
Continual Learning: Investigated how model size affects knowledge acquisition and retention during continual pretraining across diverse domains
Activation Engineering: Developed techniques to access latent knowledge through steering vectors and activation patterns

🎯 Impact & Applications

This research enables practical breakthroughs in:

Model Efficiency: Targeted knowledge editing without full retraining
Safety & Control: Direct model steering through activation engineering
Robust Evaluation: Better understanding of what models truly know vs. memorize
Knowledge Transfer: Optimizing how models adapt to new domains

Research conducted at the Bethge Lab, Vernade Lab, and in collaboration with the Mental Health Mapping Lab, University of Tübingen.